Project Title: Breast Cancer Diagnostic Prediction¶
Dataset Link: https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data?resource=download¶
Description of Dataset:¶
The Breast Cancer Wisconsin (Diagnostic) Data Set is used to predict whether a tumor is malignant (cancerous) or benign (non-cancerous) based on features that describe characteristics of the cell nuclei. These features, included in the data set, are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.
Number of Instances: 569
Variables:
- id (ID): ID number
- diagnosis (TARGET): The diagnosis of breast tissues (M = malignant, B = benign)
- radius_mean: The mean of distances from center to points on the perimeter
- texture_mean: The standard deviation of gray-scale values
- perimeter_mean: The mean size of the core tumor
- area_mean: The average area of the nucleus
- smoothness_mean: The mean of local variation in radius lengths
- compactness_mean: The mean of perimeter^2 / area - 1.0 of the nucleus shape
- concavity_mean: The mean of severity of concave portions of the contour
- concave points_mean: The mean for number of concave portions of the contour
- symmetry_mean: The mean for symmetry of the nucleus shape
- fractal_dimension_mean: The mean for "coastline approximation" - 1
- radius_se: Standard error of radius_mean
- texture_se: Standard error of the texture_mean
- perimeter_se: Standard error of the perimeter_mean
- area_se: Standard error of the area_mean
- smoothness_se: Standard error of the smoothness_mean
- compactness_se: Standard error of the compactness_mean
- concavity_se: Standard error of the concavity_mean
- concave points_se: Standard error of the concave points_mean
- symmetry_se: Standard error of the symmetry_mean
- fractal_dimension_se: Standard error of the fractal_dimension_mean
- radius_worst: "Worst" or largest mean value for the radius_mean
- texture_worst — "Worst" or largest mean value for the texture_mean
- perimeter_worst — "Worst" or largest mean value for the perimeter_mean
- area_worst — "Worst" or largest mean value for the area_mean
- smoothness_worst — "Worst" or largest mean value for the smoothness_mean
- compactness_worst — "Worst" or largest mean value for the compactness_mean
- concavity_worst — "Worst" or largest mean value for the concavity_mean
- concave points_worst — "Worst" or largest mean value for the concave points_mean
- symmetry_worst — "Worst" or largest mean value for the symmetry_mean
- fractal_dimension_worst — "Worst" or largest mean value for thefractal_dimension_mean
In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import scipy.stats as stats
from scipy.stats import skew, norm, probplot
from matplotlib.pyplot import boxplot
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
In [2]:
# Import
df = pd.read_csv("data.csv")
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 569 entries, 0 to 568 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 569 non-null int64 1 diagnosis 569 non-null object 2 radius_mean 569 non-null float64 3 texture_mean 569 non-null float64 4 perimeter_mean 569 non-null float64 5 area_mean 569 non-null float64 6 smoothness_mean 569 non-null float64 7 compactness_mean 569 non-null float64 8 concavity_mean 569 non-null float64 9 concave points_mean 569 non-null float64 10 symmetry_mean 569 non-null float64 11 fractal_dimension_mean 569 non-null float64 12 radius_se 569 non-null float64 13 texture_se 569 non-null float64 14 perimeter_se 569 non-null float64 15 area_se 569 non-null float64 16 smoothness_se 569 non-null float64 17 compactness_se 569 non-null float64 18 concavity_se 569 non-null float64 19 concave points_se 569 non-null float64 20 symmetry_se 569 non-null float64 21 fractal_dimension_se 569 non-null float64 22 radius_worst 569 non-null float64 23 texture_worst 569 non-null float64 24 perimeter_worst 569 non-null float64 25 area_worst 569 non-null float64 26 smoothness_worst 569 non-null float64 27 compactness_worst 569 non-null float64 28 concavity_worst 569 non-null float64 29 concave points_worst 569 non-null float64 30 symmetry_worst 569 non-null float64 31 fractal_dimension_worst 569 non-null float64 32 Unnamed: 32 0 non-null float64 dtypes: float64(31), int64(1), object(1) memory usage: 146.8+ KB
Data Cleaning¶
In [4]:
# Cleaning Data
# Removed 2 columns, an empty one and an id column, because it doesn't affect the diagnosis variable
del df['Unnamed: 32']
del df['id']
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 569 entries, 0 to 568 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 diagnosis 569 non-null object 1 radius_mean 569 non-null float64 2 texture_mean 569 non-null float64 3 perimeter_mean 569 non-null float64 4 area_mean 569 non-null float64 5 smoothness_mean 569 non-null float64 6 compactness_mean 569 non-null float64 7 concavity_mean 569 non-null float64 8 concave points_mean 569 non-null float64 9 symmetry_mean 569 non-null float64 10 fractal_dimension_mean 569 non-null float64 11 radius_se 569 non-null float64 12 texture_se 569 non-null float64 13 perimeter_se 569 non-null float64 14 area_se 569 non-null float64 15 smoothness_se 569 non-null float64 16 compactness_se 569 non-null float64 17 concavity_se 569 non-null float64 18 concave points_se 569 non-null float64 19 symmetry_se 569 non-null float64 20 fractal_dimension_se 569 non-null float64 21 radius_worst 569 non-null float64 22 texture_worst 569 non-null float64 23 perimeter_worst 569 non-null float64 24 area_worst 569 non-null float64 25 smoothness_worst 569 non-null float64 26 compactness_worst 569 non-null float64 27 concavity_worst 569 non-null float64 28 concave points_worst 569 non-null float64 29 symmetry_worst 569 non-null float64 30 fractal_dimension_worst 569 non-null float64 dtypes: float64(30), object(1) memory usage: 137.9+ KB
In [6]:
# Convert 'diagnosis' (TARGET) column: M → 1, B → 0
# It will be easier to use for models
df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0})
In [7]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 569 entries, 0 to 568 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 diagnosis 569 non-null int64 1 radius_mean 569 non-null float64 2 texture_mean 569 non-null float64 3 perimeter_mean 569 non-null float64 4 area_mean 569 non-null float64 5 smoothness_mean 569 non-null float64 6 compactness_mean 569 non-null float64 7 concavity_mean 569 non-null float64 8 concave points_mean 569 non-null float64 9 symmetry_mean 569 non-null float64 10 fractal_dimension_mean 569 non-null float64 11 radius_se 569 non-null float64 12 texture_se 569 non-null float64 13 perimeter_se 569 non-null float64 14 area_se 569 non-null float64 15 smoothness_se 569 non-null float64 16 compactness_se 569 non-null float64 17 concavity_se 569 non-null float64 18 concave points_se 569 non-null float64 19 symmetry_se 569 non-null float64 20 fractal_dimension_se 569 non-null float64 21 radius_worst 569 non-null float64 22 texture_worst 569 non-null float64 23 perimeter_worst 569 non-null float64 24 area_worst 569 non-null float64 25 smoothness_worst 569 non-null float64 26 compactness_worst 569 non-null float64 27 concavity_worst 569 non-null float64 28 concave points_worst 569 non-null float64 29 symmetry_worst 569 non-null float64 30 fractal_dimension_worst 569 non-null float64 dtypes: float64(30), int64(1) memory usage: 137.9 KB
In [8]:
df.head()
Out[8]:
| diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | symmetry_mean | ... | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | ... | 25.38 | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 |
| 1 | 1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | ... | 24.99 | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 |
| 2 | 1 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | ... | 23.57 | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 |
| 3 | 1 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | ... | 14.91 | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 |
| 4 | 1 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | ... | 22.54 | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 |
5 rows × 31 columns
In [9]:
df.tail()
Out[9]:
| diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | symmetry_mean | ... | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 564 | 1 | 21.56 | 22.39 | 142.00 | 1479.0 | 0.11100 | 0.11590 | 0.24390 | 0.13890 | 0.1726 | ... | 25.450 | 26.40 | 166.10 | 2027.0 | 0.14100 | 0.21130 | 0.4107 | 0.2216 | 0.2060 | 0.07115 |
| 565 | 1 | 20.13 | 28.25 | 131.20 | 1261.0 | 0.09780 | 0.10340 | 0.14400 | 0.09791 | 0.1752 | ... | 23.690 | 38.25 | 155.00 | 1731.0 | 0.11660 | 0.19220 | 0.3215 | 0.1628 | 0.2572 | 0.06637 |
| 566 | 1 | 16.60 | 28.08 | 108.30 | 858.1 | 0.08455 | 0.10230 | 0.09251 | 0.05302 | 0.1590 | ... | 18.980 | 34.12 | 126.70 | 1124.0 | 0.11390 | 0.30940 | 0.3403 | 0.1418 | 0.2218 | 0.07820 |
| 567 | 1 | 20.60 | 29.33 | 140.10 | 1265.0 | 0.11780 | 0.27700 | 0.35140 | 0.15200 | 0.2397 | ... | 25.740 | 39.42 | 184.60 | 1821.0 | 0.16500 | 0.86810 | 0.9387 | 0.2650 | 0.4087 | 0.12400 |
| 568 | 0 | 7.76 | 24.54 | 47.92 | 181.0 | 0.05263 | 0.04362 | 0.00000 | 0.00000 | 0.1587 | ... | 9.456 | 30.37 | 59.16 | 268.6 | 0.08996 | 0.06444 | 0.0000 | 0.0000 | 0.2871 | 0.07039 |
5 rows × 31 columns
Descriptive Statistcs¶
In [10]:
# Calculated to gain insights intro data distribution
df.mean()
Out[10]:
diagnosis 0.372583 radius_mean 14.127292 texture_mean 19.289649 perimeter_mean 91.969033 area_mean 654.889104 smoothness_mean 0.096360 compactness_mean 0.104341 concavity_mean 0.088799 concave points_mean 0.048919 symmetry_mean 0.181162 fractal_dimension_mean 0.062798 radius_se 0.405172 texture_se 1.216853 perimeter_se 2.866059 area_se 40.337079 smoothness_se 0.007041 compactness_se 0.025478 concavity_se 0.031894 concave points_se 0.011796 symmetry_se 0.020542 fractal_dimension_se 0.003795 radius_worst 16.269190 texture_worst 25.677223 perimeter_worst 107.261213 area_worst 880.583128 smoothness_worst 0.132369 compactness_worst 0.254265 concavity_worst 0.272188 concave points_worst 0.114606 symmetry_worst 0.290076 fractal_dimension_worst 0.083946 dtype: float64
In [11]:
df.median()
Out[11]:
diagnosis 0.000000 radius_mean 13.370000 texture_mean 18.840000 perimeter_mean 86.240000 area_mean 551.100000 smoothness_mean 0.095870 compactness_mean 0.092630 concavity_mean 0.061540 concave points_mean 0.033500 symmetry_mean 0.179200 fractal_dimension_mean 0.061540 radius_se 0.324200 texture_se 1.108000 perimeter_se 2.287000 area_se 24.530000 smoothness_se 0.006380 compactness_se 0.020450 concavity_se 0.025890 concave points_se 0.010930 symmetry_se 0.018730 fractal_dimension_se 0.003187 radius_worst 14.970000 texture_worst 25.410000 perimeter_worst 97.660000 area_worst 686.500000 smoothness_worst 0.131300 compactness_worst 0.211900 concavity_worst 0.226700 concave points_worst 0.099930 symmetry_worst 0.282200 fractal_dimension_worst 0.080040 dtype: float64
In [12]:
df.mode()
Out[12]:
| diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | symmetry_mean | ... | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 12.34 | 14.93 | 82.61 | 512.2 | 0.1007 | 0.1147 | 0.0 | 0.0 | 0.1601 | ... | 12.36 | 17.70 | 101.7 | 284.4 | 0.1216 | 0.1486 | 0.0 | 0.0 | 0.2226 | 0.07427 |
| 1 | NaN | NaN | 15.70 | 87.76 | NaN | NaN | 0.1206 | NaN | NaN | 0.1714 | ... | NaN | 27.26 | 105.9 | 402.8 | 0.1223 | 0.3416 | NaN | NaN | 0.2369 | NaN |
| 2 | NaN | NaN | 16.84 | 134.70 | NaN | NaN | NaN | NaN | NaN | 0.1717 | ... | NaN | NaN | 117.7 | 439.6 | 0.1234 | NaN | NaN | NaN | 0.2383 | NaN |
| 3 | NaN | NaN | 16.85 | NaN | NaN | NaN | NaN | NaN | NaN | 0.1769 | ... | NaN | NaN | NaN | 458.0 | 0.1256 | NaN | NaN | NaN | 0.2972 | NaN |
| 4 | NaN | NaN | 17.46 | NaN | NaN | NaN | NaN | NaN | NaN | 0.1893 | ... | NaN | NaN | NaN | 472.4 | 0.1275 | NaN | NaN | NaN | 0.3109 | NaN |
| 5 | NaN | NaN | 18.22 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 489.5 | 0.1312 | NaN | NaN | NaN | 0.3196 | NaN |
| 6 | NaN | NaN | 18.90 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 546.7 | 0.1347 | NaN | NaN | NaN | NaN | NaN |
| 7 | NaN | NaN | 19.83 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 547.4 | 0.1401 | NaN | NaN | NaN | NaN | NaN |
| 8 | NaN | NaN | 20.52 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 624.1 | 0.1415 | NaN | NaN | NaN | NaN | NaN |
| 9 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 698.8 | NaN | NaN | NaN | NaN | NaN | NaN |
| 10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 706.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 11 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 708.8 | NaN | NaN | NaN | NaN | NaN | NaN |
| 12 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 725.9 | NaN | NaN | NaN | NaN | NaN | NaN |
| 13 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 733.5 | NaN | NaN | NaN | NaN | NaN | NaN |
| 14 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 749.9 | NaN | NaN | NaN | NaN | NaN | NaN |
| 15 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 808.9 | NaN | NaN | NaN | NaN | NaN | NaN |
| 16 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 826.4 | NaN | NaN | NaN | NaN | NaN | NaN |
| 17 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 830.5 | NaN | NaN | NaN | NaN | NaN | NaN |
| 18 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 1210.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 19 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 1261.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 20 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 1269.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 21 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 1437.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 22 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 1603.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 23 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 1623.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 24 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 1750.0 | NaN | NaN | NaN | NaN | NaN | NaN |
25 rows × 31 columns
In [13]:
df.std()
Out[13]:
diagnosis 0.483918 radius_mean 3.524049 texture_mean 4.301036 perimeter_mean 24.298981 area_mean 351.914129 smoothness_mean 0.014064 compactness_mean 0.052813 concavity_mean 0.079720 concave points_mean 0.038803 symmetry_mean 0.027414 fractal_dimension_mean 0.007060 radius_se 0.277313 texture_se 0.551648 perimeter_se 2.021855 area_se 45.491006 smoothness_se 0.003003 compactness_se 0.017908 concavity_se 0.030186 concave points_se 0.006170 symmetry_se 0.008266 fractal_dimension_se 0.002646 radius_worst 4.833242 texture_worst 6.146258 perimeter_worst 33.602542 area_worst 569.356993 smoothness_worst 0.022832 compactness_worst 0.157336 concavity_worst 0.208624 concave points_worst 0.065732 symmetry_worst 0.061867 fractal_dimension_worst 0.018061 dtype: float64
In [14]:
df.std()*3
Out[14]:
diagnosis 1.451754 radius_mean 10.572146 texture_mean 12.903107 perimeter_mean 72.896943 area_mean 1055.742388 smoothness_mean 0.042192 compactness_mean 0.158438 concavity_mean 0.239159 concave points_mean 0.116409 symmetry_mean 0.082243 fractal_dimension_mean 0.021181 radius_se 0.831938 texture_se 1.654945 perimeter_se 6.065564 area_se 136.473017 smoothness_se 0.009008 compactness_se 0.053725 concavity_se 0.090558 concave points_se 0.018511 symmetry_se 0.024799 fractal_dimension_se 0.007938 radius_worst 14.499725 texture_worst 18.438773 perimeter_worst 100.807627 area_worst 1708.070978 smoothness_worst 0.068497 compactness_worst 0.472009 concavity_worst 0.625873 concave points_worst 0.197197 symmetry_worst 0.185602 fractal_dimension_worst 0.054184 dtype: float64
In [15]:
df.columns
Out[15]:
Index(['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
'fractal_dimension_se', 'radius_worst', 'texture_worst',
'perimeter_worst', 'area_worst', 'smoothness_worst',
'compactness_worst', 'concavity_worst', 'concave points_worst',
'symmetry_worst', 'fractal_dimension_worst'],
dtype='object')
In [16]:
# Number of rows, columns
df.shape
Out[16]:
(569, 31)
In [17]:
df.describe()
Out[17]:
| diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | symmetry_mean | ... | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | ... | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 |
| mean | 0.372583 | 14.127292 | 19.289649 | 91.969033 | 654.889104 | 0.096360 | 0.104341 | 0.088799 | 0.048919 | 0.181162 | ... | 16.269190 | 25.677223 | 107.261213 | 880.583128 | 0.132369 | 0.254265 | 0.272188 | 0.114606 | 0.290076 | 0.083946 |
| std | 0.483918 | 3.524049 | 4.301036 | 24.298981 | 351.914129 | 0.014064 | 0.052813 | 0.079720 | 0.038803 | 0.027414 | ... | 4.833242 | 6.146258 | 33.602542 | 569.356993 | 0.022832 | 0.157336 | 0.208624 | 0.065732 | 0.061867 | 0.018061 |
| min | 0.000000 | 6.981000 | 9.710000 | 43.790000 | 143.500000 | 0.052630 | 0.019380 | 0.000000 | 0.000000 | 0.106000 | ... | 7.930000 | 12.020000 | 50.410000 | 185.200000 | 0.071170 | 0.027290 | 0.000000 | 0.000000 | 0.156500 | 0.055040 |
| 25% | 0.000000 | 11.700000 | 16.170000 | 75.170000 | 420.300000 | 0.086370 | 0.064920 | 0.029560 | 0.020310 | 0.161900 | ... | 13.010000 | 21.080000 | 84.110000 | 515.300000 | 0.116600 | 0.147200 | 0.114500 | 0.064930 | 0.250400 | 0.071460 |
| 50% | 0.000000 | 13.370000 | 18.840000 | 86.240000 | 551.100000 | 0.095870 | 0.092630 | 0.061540 | 0.033500 | 0.179200 | ... | 14.970000 | 25.410000 | 97.660000 | 686.500000 | 0.131300 | 0.211900 | 0.226700 | 0.099930 | 0.282200 | 0.080040 |
| 75% | 1.000000 | 15.780000 | 21.800000 | 104.100000 | 782.700000 | 0.105300 | 0.130400 | 0.130700 | 0.074000 | 0.195700 | ... | 18.790000 | 29.720000 | 125.400000 | 1084.000000 | 0.146000 | 0.339100 | 0.382900 | 0.161400 | 0.317900 | 0.092080 |
| max | 1.000000 | 28.110000 | 39.280000 | 188.500000 | 2501.000000 | 0.163400 | 0.345400 | 0.426800 | 0.201200 | 0.304000 | ... | 36.040000 | 49.540000 | 251.200000 | 4254.000000 | 0.222600 | 1.058000 | 1.252000 | 0.291000 | 0.663800 | 0.207500 |
8 rows × 31 columns
In [18]:
df.describe().T
Out[18]:
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| diagnosis | 569.0 | 0.372583 | 0.483918 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.00000 |
| radius_mean | 569.0 | 14.127292 | 3.524049 | 6.981000 | 11.700000 | 13.370000 | 15.780000 | 28.11000 |
| texture_mean | 569.0 | 19.289649 | 4.301036 | 9.710000 | 16.170000 | 18.840000 | 21.800000 | 39.28000 |
| perimeter_mean | 569.0 | 91.969033 | 24.298981 | 43.790000 | 75.170000 | 86.240000 | 104.100000 | 188.50000 |
| area_mean | 569.0 | 654.889104 | 351.914129 | 143.500000 | 420.300000 | 551.100000 | 782.700000 | 2501.00000 |
| smoothness_mean | 569.0 | 0.096360 | 0.014064 | 0.052630 | 0.086370 | 0.095870 | 0.105300 | 0.16340 |
| compactness_mean | 569.0 | 0.104341 | 0.052813 | 0.019380 | 0.064920 | 0.092630 | 0.130400 | 0.34540 |
| concavity_mean | 569.0 | 0.088799 | 0.079720 | 0.000000 | 0.029560 | 0.061540 | 0.130700 | 0.42680 |
| concave points_mean | 569.0 | 0.048919 | 0.038803 | 0.000000 | 0.020310 | 0.033500 | 0.074000 | 0.20120 |
| symmetry_mean | 569.0 | 0.181162 | 0.027414 | 0.106000 | 0.161900 | 0.179200 | 0.195700 | 0.30400 |
| fractal_dimension_mean | 569.0 | 0.062798 | 0.007060 | 0.049960 | 0.057700 | 0.061540 | 0.066120 | 0.09744 |
| radius_se | 569.0 | 0.405172 | 0.277313 | 0.111500 | 0.232400 | 0.324200 | 0.478900 | 2.87300 |
| texture_se | 569.0 | 1.216853 | 0.551648 | 0.360200 | 0.833900 | 1.108000 | 1.474000 | 4.88500 |
| perimeter_se | 569.0 | 2.866059 | 2.021855 | 0.757000 | 1.606000 | 2.287000 | 3.357000 | 21.98000 |
| area_se | 569.0 | 40.337079 | 45.491006 | 6.802000 | 17.850000 | 24.530000 | 45.190000 | 542.20000 |
| smoothness_se | 569.0 | 0.007041 | 0.003003 | 0.001713 | 0.005169 | 0.006380 | 0.008146 | 0.03113 |
| compactness_se | 569.0 | 0.025478 | 0.017908 | 0.002252 | 0.013080 | 0.020450 | 0.032450 | 0.13540 |
| concavity_se | 569.0 | 0.031894 | 0.030186 | 0.000000 | 0.015090 | 0.025890 | 0.042050 | 0.39600 |
| concave points_se | 569.0 | 0.011796 | 0.006170 | 0.000000 | 0.007638 | 0.010930 | 0.014710 | 0.05279 |
| symmetry_se | 569.0 | 0.020542 | 0.008266 | 0.007882 | 0.015160 | 0.018730 | 0.023480 | 0.07895 |
| fractal_dimension_se | 569.0 | 0.003795 | 0.002646 | 0.000895 | 0.002248 | 0.003187 | 0.004558 | 0.02984 |
| radius_worst | 569.0 | 16.269190 | 4.833242 | 7.930000 | 13.010000 | 14.970000 | 18.790000 | 36.04000 |
| texture_worst | 569.0 | 25.677223 | 6.146258 | 12.020000 | 21.080000 | 25.410000 | 29.720000 | 49.54000 |
| perimeter_worst | 569.0 | 107.261213 | 33.602542 | 50.410000 | 84.110000 | 97.660000 | 125.400000 | 251.20000 |
| area_worst | 569.0 | 880.583128 | 569.356993 | 185.200000 | 515.300000 | 686.500000 | 1084.000000 | 4254.00000 |
| smoothness_worst | 569.0 | 0.132369 | 0.022832 | 0.071170 | 0.116600 | 0.131300 | 0.146000 | 0.22260 |
| compactness_worst | 569.0 | 0.254265 | 0.157336 | 0.027290 | 0.147200 | 0.211900 | 0.339100 | 1.05800 |
| concavity_worst | 569.0 | 0.272188 | 0.208624 | 0.000000 | 0.114500 | 0.226700 | 0.382900 | 1.25200 |
| concave points_worst | 569.0 | 0.114606 | 0.065732 | 0.000000 | 0.064930 | 0.099930 | 0.161400 | 0.29100 |
| symmetry_worst | 569.0 | 0.290076 | 0.061867 | 0.156500 | 0.250400 | 0.282200 | 0.317900 | 0.66380 |
| fractal_dimension_worst | 569.0 | 0.083946 | 0.018061 | 0.055040 | 0.071460 | 0.080040 | 0.092080 | 0.20750 |
In [19]:
# Reduce number of decimals
pd.set_option('display.float_format', lambda x: '%.2f' % x)
df.describe().T
Out[19]:
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| diagnosis | 569.00 | 0.37 | 0.48 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 |
| radius_mean | 569.00 | 14.13 | 3.52 | 6.98 | 11.70 | 13.37 | 15.78 | 28.11 |
| texture_mean | 569.00 | 19.29 | 4.30 | 9.71 | 16.17 | 18.84 | 21.80 | 39.28 |
| perimeter_mean | 569.00 | 91.97 | 24.30 | 43.79 | 75.17 | 86.24 | 104.10 | 188.50 |
| area_mean | 569.00 | 654.89 | 351.91 | 143.50 | 420.30 | 551.10 | 782.70 | 2501.00 |
| smoothness_mean | 569.00 | 0.10 | 0.01 | 0.05 | 0.09 | 0.10 | 0.11 | 0.16 |
| compactness_mean | 569.00 | 0.10 | 0.05 | 0.02 | 0.06 | 0.09 | 0.13 | 0.35 |
| concavity_mean | 569.00 | 0.09 | 0.08 | 0.00 | 0.03 | 0.06 | 0.13 | 0.43 |
| concave points_mean | 569.00 | 0.05 | 0.04 | 0.00 | 0.02 | 0.03 | 0.07 | 0.20 |
| symmetry_mean | 569.00 | 0.18 | 0.03 | 0.11 | 0.16 | 0.18 | 0.20 | 0.30 |
| fractal_dimension_mean | 569.00 | 0.06 | 0.01 | 0.05 | 0.06 | 0.06 | 0.07 | 0.10 |
| radius_se | 569.00 | 0.41 | 0.28 | 0.11 | 0.23 | 0.32 | 0.48 | 2.87 |
| texture_se | 569.00 | 1.22 | 0.55 | 0.36 | 0.83 | 1.11 | 1.47 | 4.88 |
| perimeter_se | 569.00 | 2.87 | 2.02 | 0.76 | 1.61 | 2.29 | 3.36 | 21.98 |
| area_se | 569.00 | 40.34 | 45.49 | 6.80 | 17.85 | 24.53 | 45.19 | 542.20 |
| smoothness_se | 569.00 | 0.01 | 0.00 | 0.00 | 0.01 | 0.01 | 0.01 | 0.03 |
| compactness_se | 569.00 | 0.03 | 0.02 | 0.00 | 0.01 | 0.02 | 0.03 | 0.14 |
| concavity_se | 569.00 | 0.03 | 0.03 | 0.00 | 0.02 | 0.03 | 0.04 | 0.40 |
| concave points_se | 569.00 | 0.01 | 0.01 | 0.00 | 0.01 | 0.01 | 0.01 | 0.05 |
| symmetry_se | 569.00 | 0.02 | 0.01 | 0.01 | 0.02 | 0.02 | 0.02 | 0.08 |
| fractal_dimension_se | 569.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.03 |
| radius_worst | 569.00 | 16.27 | 4.83 | 7.93 | 13.01 | 14.97 | 18.79 | 36.04 |
| texture_worst | 569.00 | 25.68 | 6.15 | 12.02 | 21.08 | 25.41 | 29.72 | 49.54 |
| perimeter_worst | 569.00 | 107.26 | 33.60 | 50.41 | 84.11 | 97.66 | 125.40 | 251.20 |
| area_worst | 569.00 | 880.58 | 569.36 | 185.20 | 515.30 | 686.50 | 1084.00 | 4254.00 |
| smoothness_worst | 569.00 | 0.13 | 0.02 | 0.07 | 0.12 | 0.13 | 0.15 | 0.22 |
| compactness_worst | 569.00 | 0.25 | 0.16 | 0.03 | 0.15 | 0.21 | 0.34 | 1.06 |
| concavity_worst | 569.00 | 0.27 | 0.21 | 0.00 | 0.11 | 0.23 | 0.38 | 1.25 |
| concave points_worst | 569.00 | 0.11 | 0.07 | 0.00 | 0.06 | 0.10 | 0.16 | 0.29 |
| symmetry_worst | 569.00 | 0.29 | 0.06 | 0.16 | 0.25 | 0.28 | 0.32 | 0.66 |
| fractal_dimension_worst | 569.00 | 0.08 | 0.02 | 0.06 | 0.07 | 0.08 | 0.09 | 0.21 |
In [20]:
df.diagnosis.value_counts()
Out[20]:
diagnosis 0 357 1 212 Name: count, dtype: int64
In [21]:
df["diagnosis"].value_counts()
Out[21]:
diagnosis 0 357 1 212 Name: count, dtype: int64
In [22]:
# A random set of 15 rows from the DataFrame will be displayed.
np.random.seed()
df.sample(n=15)
Out[22]:
| diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | symmetry_mean | ... | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 150 | 0 | 13.00 | 20.78 | 83.51 | 519.40 | 0.11 | 0.08 | 0.03 | 0.03 | 0.25 | ... | 14.16 | 24.11 | 90.82 | 616.70 | 0.13 | 0.11 | 0.08 | 0.06 | 0.32 | 0.06 |
| 515 | 0 | 11.34 | 18.61 | 72.76 | 391.20 | 0.10 | 0.08 | 0.04 | 0.03 | 0.19 | ... | 12.47 | 23.03 | 79.15 | 478.60 | 0.15 | 0.16 | 0.16 | 0.09 | 0.31 | 0.07 |
| 470 | 0 | 9.67 | 18.49 | 61.49 | 289.10 | 0.09 | 0.06 | 0.03 | 0.02 | 0.22 | ... | 11.14 | 25.62 | 70.88 | 385.20 | 0.12 | 0.15 | 0.13 | 0.07 | 0.32 | 0.09 |
| 31 | 1 | 11.84 | 18.70 | 77.93 | 440.60 | 0.11 | 0.15 | 0.12 | 0.05 | 0.23 | ... | 16.82 | 28.12 | 119.40 | 888.70 | 0.16 | 0.58 | 0.70 | 0.15 | 0.48 | 0.14 |
| 417 | 1 | 15.50 | 21.08 | 102.90 | 803.10 | 0.11 | 0.16 | 0.15 | 0.08 | 0.21 | ... | 23.17 | 27.65 | 157.10 | 1748.00 | 0.15 | 0.40 | 0.42 | 0.21 | 0.30 | 0.10 |
| 78 | 1 | 20.18 | 23.97 | 143.70 | 1245.00 | 0.13 | 0.35 | 0.38 | 0.16 | 0.29 | ... | 23.37 | 31.72 | 170.30 | 1623.00 | 0.16 | 0.62 | 0.77 | 0.25 | 0.54 | 0.10 |
| 195 | 0 | 12.91 | 16.33 | 82.53 | 516.40 | 0.08 | 0.05 | 0.04 | 0.02 | 0.18 | ... | 13.88 | 22.00 | 90.81 | 600.60 | 0.11 | 0.15 | 0.18 | 0.08 | 0.30 | 0.07 |
| 234 | 0 | 9.57 | 15.91 | 60.21 | 279.60 | 0.08 | 0.04 | 0.02 | 0.02 | 0.16 | ... | 10.51 | 19.16 | 65.74 | 335.90 | 0.15 | 0.10 | 0.07 | 0.07 | 0.28 | 0.08 |
| 506 | 0 | 12.22 | 20.04 | 79.47 | 453.10 | 0.11 | 0.12 | 0.08 | 0.02 | 0.21 | ... | 13.16 | 24.17 | 85.13 | 515.30 | 0.14 | 0.23 | 0.35 | 0.08 | 0.27 | 0.09 |
| 406 | 0 | 16.14 | 14.86 | 104.30 | 800.00 | 0.09 | 0.09 | 0.06 | 0.05 | 0.17 | ... | 17.71 | 19.58 | 115.90 | 947.90 | 0.12 | 0.17 | 0.23 | 0.11 | 0.28 | 0.07 |
| 40 | 1 | 13.44 | 21.58 | 86.18 | 563.00 | 0.08 | 0.06 | 0.03 | 0.02 | 0.18 | ... | 15.93 | 30.25 | 102.50 | 787.90 | 0.11 | 0.20 | 0.21 | 0.11 | 0.30 | 0.07 |
| 77 | 1 | 18.05 | 16.15 | 120.20 | 1006.00 | 0.11 | 0.21 | 0.17 | 0.11 | 0.22 | ... | 22.39 | 18.91 | 150.10 | 1610.00 | 0.15 | 0.56 | 0.38 | 0.21 | 0.38 | 0.11 |
| 4 | 1 | 20.29 | 14.34 | 135.10 | 1297.00 | 0.10 | 0.13 | 0.20 | 0.10 | 0.18 | ... | 22.54 | 16.67 | 152.20 | 1575.00 | 0.14 | 0.20 | 0.40 | 0.16 | 0.24 | 0.08 |
| 87 | 1 | 19.02 | 24.59 | 122.00 | 1076.00 | 0.09 | 0.12 | 0.15 | 0.08 | 0.20 | ... | 24.56 | 30.41 | 152.90 | 1623.00 | 0.12 | 0.32 | 0.58 | 0.20 | 0.40 | 0.09 |
| 360 | 0 | 12.54 | 18.07 | 79.42 | 491.90 | 0.07 | 0.03 | 0.00 | 0.01 | 0.15 | ... | 13.72 | 20.98 | 86.82 | 585.70 | 0.09 | 0.04 | 0.00 | 0.02 | 0.22 | 0.06 |
15 rows × 31 columns
In [23]:
# Checking for duplicated data
df[df.duplicated()].count()
Out[23]:
diagnosis 0 radius_mean 0 texture_mean 0 perimeter_mean 0 area_mean 0 smoothness_mean 0 compactness_mean 0 concavity_mean 0 concave points_mean 0 symmetry_mean 0 fractal_dimension_mean 0 radius_se 0 texture_se 0 perimeter_se 0 area_se 0 smoothness_se 0 compactness_se 0 concavity_se 0 concave points_se 0 symmetry_se 0 fractal_dimension_se 0 radius_worst 0 texture_worst 0 perimeter_worst 0 area_worst 0 smoothness_worst 0 compactness_worst 0 concavity_worst 0 concave points_worst 0 symmetry_worst 0 fractal_dimension_worst 0 dtype: int64
Exploratory Data Analysis¶
In [24]:
# To explore relationships between variables from the data
# Pearson's Correlation
df.corr()
Out[24]:
| diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | symmetry_mean | ... | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| diagnosis | 1.00 | 0.73 | 0.42 | 0.74 | 0.71 | 0.36 | 0.60 | 0.70 | 0.78 | 0.33 | ... | 0.78 | 0.46 | 0.78 | 0.73 | 0.42 | 0.59 | 0.66 | 0.79 | 0.42 | 0.32 |
| radius_mean | 0.73 | 1.00 | 0.32 | 1.00 | 0.99 | 0.17 | 0.51 | 0.68 | 0.82 | 0.15 | ... | 0.97 | 0.30 | 0.97 | 0.94 | 0.12 | 0.41 | 0.53 | 0.74 | 0.16 | 0.01 |
| texture_mean | 0.42 | 0.32 | 1.00 | 0.33 | 0.32 | -0.02 | 0.24 | 0.30 | 0.29 | 0.07 | ... | 0.35 | 0.91 | 0.36 | 0.34 | 0.08 | 0.28 | 0.30 | 0.30 | 0.11 | 0.12 |
| perimeter_mean | 0.74 | 1.00 | 0.33 | 1.00 | 0.99 | 0.21 | 0.56 | 0.72 | 0.85 | 0.18 | ... | 0.97 | 0.30 | 0.97 | 0.94 | 0.15 | 0.46 | 0.56 | 0.77 | 0.19 | 0.05 |
| area_mean | 0.71 | 0.99 | 0.32 | 0.99 | 1.00 | 0.18 | 0.50 | 0.69 | 0.82 | 0.15 | ... | 0.96 | 0.29 | 0.96 | 0.96 | 0.12 | 0.39 | 0.51 | 0.72 | 0.14 | 0.00 |
| smoothness_mean | 0.36 | 0.17 | -0.02 | 0.21 | 0.18 | 1.00 | 0.66 | 0.52 | 0.55 | 0.56 | ... | 0.21 | 0.04 | 0.24 | 0.21 | 0.81 | 0.47 | 0.43 | 0.50 | 0.39 | 0.50 |
| compactness_mean | 0.60 | 0.51 | 0.24 | 0.56 | 0.50 | 0.66 | 1.00 | 0.88 | 0.83 | 0.60 | ... | 0.54 | 0.25 | 0.59 | 0.51 | 0.57 | 0.87 | 0.82 | 0.82 | 0.51 | 0.69 |
| concavity_mean | 0.70 | 0.68 | 0.30 | 0.72 | 0.69 | 0.52 | 0.88 | 1.00 | 0.92 | 0.50 | ... | 0.69 | 0.30 | 0.73 | 0.68 | 0.45 | 0.75 | 0.88 | 0.86 | 0.41 | 0.51 |
| concave points_mean | 0.78 | 0.82 | 0.29 | 0.85 | 0.82 | 0.55 | 0.83 | 0.92 | 1.00 | 0.46 | ... | 0.83 | 0.29 | 0.86 | 0.81 | 0.45 | 0.67 | 0.75 | 0.91 | 0.38 | 0.37 |
| symmetry_mean | 0.33 | 0.15 | 0.07 | 0.18 | 0.15 | 0.56 | 0.60 | 0.50 | 0.46 | 1.00 | ... | 0.19 | 0.09 | 0.22 | 0.18 | 0.43 | 0.47 | 0.43 | 0.43 | 0.70 | 0.44 |
| fractal_dimension_mean | -0.01 | -0.31 | -0.08 | -0.26 | -0.28 | 0.58 | 0.57 | 0.34 | 0.17 | 0.48 | ... | -0.25 | -0.05 | -0.21 | -0.23 | 0.50 | 0.46 | 0.35 | 0.18 | 0.33 | 0.77 |
| radius_se | 0.57 | 0.68 | 0.28 | 0.69 | 0.73 | 0.30 | 0.50 | 0.63 | 0.70 | 0.30 | ... | 0.72 | 0.19 | 0.72 | 0.75 | 0.14 | 0.29 | 0.38 | 0.53 | 0.09 | 0.05 |
| texture_se | -0.01 | -0.10 | 0.39 | -0.09 | -0.07 | 0.07 | 0.05 | 0.08 | 0.02 | 0.13 | ... | -0.11 | 0.41 | -0.10 | -0.08 | -0.07 | -0.09 | -0.07 | -0.12 | -0.13 | -0.05 |
| perimeter_se | 0.56 | 0.67 | 0.28 | 0.69 | 0.73 | 0.30 | 0.55 | 0.66 | 0.71 | 0.31 | ... | 0.70 | 0.20 | 0.72 | 0.73 | 0.13 | 0.34 | 0.42 | 0.55 | 0.11 | 0.09 |
| area_se | 0.55 | 0.74 | 0.26 | 0.74 | 0.80 | 0.25 | 0.46 | 0.62 | 0.69 | 0.22 | ... | 0.76 | 0.20 | 0.76 | 0.81 | 0.13 | 0.28 | 0.39 | 0.54 | 0.07 | 0.02 |
| smoothness_se | -0.07 | -0.22 | 0.01 | -0.20 | -0.17 | 0.33 | 0.14 | 0.10 | 0.03 | 0.19 | ... | -0.23 | -0.07 | -0.22 | -0.18 | 0.31 | -0.06 | -0.06 | -0.10 | -0.11 | 0.10 |
| compactness_se | 0.29 | 0.21 | 0.19 | 0.25 | 0.21 | 0.32 | 0.74 | 0.67 | 0.49 | 0.42 | ... | 0.20 | 0.14 | 0.26 | 0.20 | 0.23 | 0.68 | 0.64 | 0.48 | 0.28 | 0.59 |
| concavity_se | 0.25 | 0.19 | 0.14 | 0.23 | 0.21 | 0.25 | 0.57 | 0.69 | 0.44 | 0.34 | ... | 0.19 | 0.10 | 0.23 | 0.19 | 0.17 | 0.48 | 0.66 | 0.44 | 0.20 | 0.44 |
| concave points_se | 0.41 | 0.38 | 0.16 | 0.41 | 0.37 | 0.38 | 0.64 | 0.68 | 0.62 | 0.39 | ... | 0.36 | 0.09 | 0.39 | 0.34 | 0.22 | 0.45 | 0.55 | 0.60 | 0.14 | 0.31 |
| symmetry_se | -0.01 | -0.10 | 0.01 | -0.08 | -0.07 | 0.20 | 0.23 | 0.18 | 0.10 | 0.45 | ... | -0.13 | -0.08 | -0.10 | -0.11 | -0.01 | 0.06 | 0.04 | -0.03 | 0.39 | 0.08 |
| fractal_dimension_se | 0.08 | -0.04 | 0.05 | -0.01 | -0.02 | 0.28 | 0.51 | 0.45 | 0.26 | 0.33 | ... | -0.04 | -0.00 | -0.00 | -0.02 | 0.17 | 0.39 | 0.38 | 0.22 | 0.11 | 0.59 |
| radius_worst | 0.78 | 0.97 | 0.35 | 0.97 | 0.96 | 0.21 | 0.54 | 0.69 | 0.83 | 0.19 | ... | 1.00 | 0.36 | 0.99 | 0.98 | 0.22 | 0.48 | 0.57 | 0.79 | 0.24 | 0.09 |
| texture_worst | 0.46 | 0.30 | 0.91 | 0.30 | 0.29 | 0.04 | 0.25 | 0.30 | 0.29 | 0.09 | ... | 0.36 | 1.00 | 0.37 | 0.35 | 0.23 | 0.36 | 0.37 | 0.36 | 0.23 | 0.22 |
| perimeter_worst | 0.78 | 0.97 | 0.36 | 0.97 | 0.96 | 0.24 | 0.59 | 0.73 | 0.86 | 0.22 | ... | 0.99 | 0.37 | 1.00 | 0.98 | 0.24 | 0.53 | 0.62 | 0.82 | 0.27 | 0.14 |
| area_worst | 0.73 | 0.94 | 0.34 | 0.94 | 0.96 | 0.21 | 0.51 | 0.68 | 0.81 | 0.18 | ... | 0.98 | 0.35 | 0.98 | 1.00 | 0.21 | 0.44 | 0.54 | 0.75 | 0.21 | 0.08 |
| smoothness_worst | 0.42 | 0.12 | 0.08 | 0.15 | 0.12 | 0.81 | 0.57 | 0.45 | 0.45 | 0.43 | ... | 0.22 | 0.23 | 0.24 | 0.21 | 1.00 | 0.57 | 0.52 | 0.55 | 0.49 | 0.62 |
| compactness_worst | 0.59 | 0.41 | 0.28 | 0.46 | 0.39 | 0.47 | 0.87 | 0.75 | 0.67 | 0.47 | ... | 0.48 | 0.36 | 0.53 | 0.44 | 0.57 | 1.00 | 0.89 | 0.80 | 0.61 | 0.81 |
| concavity_worst | 0.66 | 0.53 | 0.30 | 0.56 | 0.51 | 0.43 | 0.82 | 0.88 | 0.75 | 0.43 | ... | 0.57 | 0.37 | 0.62 | 0.54 | 0.52 | 0.89 | 1.00 | 0.86 | 0.53 | 0.69 |
| concave points_worst | 0.79 | 0.74 | 0.30 | 0.77 | 0.72 | 0.50 | 0.82 | 0.86 | 0.91 | 0.43 | ... | 0.79 | 0.36 | 0.82 | 0.75 | 0.55 | 0.80 | 0.86 | 1.00 | 0.50 | 0.51 |
| symmetry_worst | 0.42 | 0.16 | 0.11 | 0.19 | 0.14 | 0.39 | 0.51 | 0.41 | 0.38 | 0.70 | ... | 0.24 | 0.23 | 0.27 | 0.21 | 0.49 | 0.61 | 0.53 | 0.50 | 1.00 | 0.54 |
| fractal_dimension_worst | 0.32 | 0.01 | 0.12 | 0.05 | 0.00 | 0.50 | 0.69 | 0.51 | 0.37 | 0.44 | ... | 0.09 | 0.22 | 0.14 | 0.08 | 0.62 | 0.81 | 0.69 | 0.51 | 0.54 | 1.00 |
31 rows × 31 columns
In [25]:
# Pearson's Correlation Heatmap
plt.figure(figsize=(50,35))
sns.set(font_scale= 1.8)
plt.rcParams["axes.labelsize"] = 10
sns.heatmap(df.corr(), annot=True);
plt.show();
In [26]:
# Data Cleaning
vars_to_remove = [
# features that were highly correlated (greater than 0.95) were removed
'perimeter_mean', 'area_mean', 'perimeter_worst', 'area_worst', 'radius_worst', 'radius_se'
]
df = df.drop(columns=vars_to_remove)
In [27]:
# Creates a histogram using the values from the "radius_mean" column of the DataFrame
plt.hist(df["radius_mean"], color = 'b');
plt.show();
In [28]:
# Creates a distribution plot of the "radius_mean" column from the DataFrame
sns.distplot(df["radius_mean"], color='r', rug=True);
plt.show();
In [29]:
# Pearson's Correlation Heatmap
plt.figure(figsize=(50,35))
sns.set(font_scale= 1.8)
plt.rcParams["axes.labelsize"] = 0.8
sns.heatmap(df.corr(), annot=True, cmap = "coolwarm");
plt.show();
In [30]:
# Spearman's Rank or Spearman's Rho correlation
plt.figure(figsize=(50,35))
sns.set(font_scale= 1.8)
plt.rcParams["axes.labelsize"] = 0.8
sns.heatmap(df.corr(method='spearman'), annot=True, cmap="coolwarm"); # nonparametric correlation
plt.show()
In [31]:
# Phi K correlations for all variables
! pip install phik
import phik
from phik import resources, report
Requirement already satisfied: phik in /opt/anaconda3/lib/python3.12/site-packages (0.12.4) Requirement already satisfied: numpy>=1.18.0 in /opt/anaconda3/lib/python3.12/site-packages (from phik) (1.26.4) Requirement already satisfied: scipy>=1.5.2 in /opt/anaconda3/lib/python3.12/site-packages (from phik) (1.13.1) Requirement already satisfied: pandas>=0.25.1 in /opt/anaconda3/lib/python3.12/site-packages (from phik) (2.2.2) Requirement already satisfied: matplotlib>=2.2.3 in /opt/anaconda3/lib/python3.12/site-packages (from phik) (3.9.2) Requirement already satisfied: joblib>=0.14.1 in /opt/anaconda3/lib/python3.12/site-packages (from phik) (1.4.2) Requirement already satisfied: contourpy>=1.0.1 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib>=2.2.3->phik) (1.2.0) Requirement already satisfied: cycler>=0.10 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib>=2.2.3->phik) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib>=2.2.3->phik) (4.51.0) Requirement already satisfied: kiwisolver>=1.3.1 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib>=2.2.3->phik) (1.4.4) Requirement already satisfied: packaging>=20.0 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib>=2.2.3->phik) (24.1) Requirement already satisfied: pillow>=8 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib>=2.2.3->phik) (10.4.0) Requirement already satisfied: pyparsing>=2.3.1 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib>=2.2.3->phik) (3.1.2) Requirement already satisfied: python-dateutil>=2.7 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib>=2.2.3->phik) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in /opt/anaconda3/lib/python3.12/site-packages (from pandas>=0.25.1->phik) (2024.1) Requirement already satisfied: tzdata>=2022.7 in /opt/anaconda3/lib/python3.12/site-packages (from pandas>=0.25.1->phik) (2023.3) Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.12/site-packages (from python-dateutil>=2.7->matplotlib>=2.2.3->phik) (1.16.0)
In [32]:
df.phik_matrix()
interval columns not set, guessing: ['diagnosis', 'radius_mean', 'texture_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'texture_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']
Out[32]:
| diagnosis | radius_mean | texture_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | symmetry_mean | fractal_dimension_mean | texture_se | ... | concave points_se | symmetry_se | fractal_dimension_se | texture_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| diagnosis | 1.00 | 0.93 | 0.61 | 0.47 | 0.78 | 0.92 | 0.96 | 0.43 | 0.21 | 0.10 | ... | 0.46 | 0.21 | 0.26 | 0.61 | 0.55 | 0.76 | 0.90 | 0.97 | 0.53 | 0.32 |
| radius_mean | 0.93 | 1.00 | 0.30 | 0.41 | 0.59 | 0.74 | 0.85 | 0.32 | 0.47 | 0.36 | ... | 0.35 | 0.36 | 0.21 | 0.30 | 0.23 | 0.52 | 0.61 | 0.76 | 0.16 | 0.19 |
| texture_mean | 0.61 | 0.30 | 1.00 | 0.00 | 0.32 | 0.40 | 0.36 | 0.16 | 0.00 | 0.36 | ... | 0.16 | 0.00 | 0.00 | 0.93 | 0.20 | 0.42 | 0.48 | 0.41 | 0.00 | 0.00 |
| smoothness_mean | 0.47 | 0.41 | 0.00 | 1.00 | 0.78 | 0.60 | 0.60 | 0.65 | 0.72 | 0.50 | ... | 0.37 | 0.44 | 0.34 | 0.00 | 0.82 | 0.53 | 0.46 | 0.53 | 0.68 | 0.57 |
| compactness_mean | 0.78 | 0.59 | 0.32 | 0.78 | 1.00 | 0.88 | 0.82 | 0.78 | 0.66 | 0.38 | ... | 0.59 | 0.69 | 0.48 | 0.18 | 0.67 | 0.86 | 0.86 | 0.83 | 0.73 | 0.62 |
| concavity_mean | 0.92 | 0.74 | 0.40 | 0.60 | 0.88 | 1.00 | 0.92 | 0.65 | 0.52 | 0.42 | ... | 0.63 | 0.45 | 0.53 | 0.36 | 0.50 | 0.76 | 0.86 | 0.84 | 0.53 | 0.45 |
| concave points_mean | 0.96 | 0.85 | 0.36 | 0.60 | 0.82 | 0.92 | 1.00 | 0.53 | 0.29 | 0.32 | ... | 0.57 | 0.28 | 0.26 | 0.35 | 0.47 | 0.69 | 0.77 | 0.90 | 0.50 | 0.34 |
| symmetry_mean | 0.43 | 0.32 | 0.16 | 0.65 | 0.78 | 0.65 | 0.53 | 1.00 | 0.59 | 0.45 | ... | 0.43 | 0.61 | 0.37 | 0.17 | 0.50 | 0.55 | 0.50 | 0.54 | 0.82 | 0.41 |
| fractal_dimension_mean | 0.21 | 0.47 | 0.00 | 0.72 | 0.66 | 0.52 | 0.29 | 0.59 | 1.00 | 0.15 | ... | 0.46 | 0.44 | 0.71 | 0.19 | 0.67 | 0.57 | 0.56 | 0.22 | 0.63 | 0.68 |
| texture_se | 0.10 | 0.36 | 0.36 | 0.50 | 0.38 | 0.42 | 0.32 | 0.45 | 0.15 | 1.00 | ... | 0.57 | 0.65 | 0.43 | 0.38 | 0.35 | 0.00 | 0.00 | 0.18 | 0.20 | 0.00 |
| perimeter_se | 0.58 | 0.71 | 0.27 | 0.30 | 0.54 | 0.67 | 0.71 | 0.39 | 0.20 | 0.34 | ... | 0.52 | 0.44 | 0.28 | 0.22 | 0.08 | 0.32 | 0.40 | 0.52 | 0.17 | 0.10 |
| area_se | 0.79 | 0.79 | 0.31 | 0.44 | 0.50 | 0.67 | 0.69 | 0.37 | 0.15 | 0.55 | ... | 0.43 | 0.38 | 0.20 | 0.33 | 0.09 | 0.29 | 0.40 | 0.52 | 0.13 | 0.00 |
| smoothness_se | 0.10 | 0.47 | 0.00 | 0.52 | 0.36 | 0.41 | 0.31 | 0.44 | 0.52 | 0.82 | ... | 0.69 | 0.69 | 0.58 | 0.00 | 0.53 | 0.00 | 0.00 | 0.19 | 0.00 | 0.17 |
| compactness_se | 0.34 | 0.28 | 0.19 | 0.41 | 0.64 | 0.65 | 0.45 | 0.44 | 0.55 | 0.55 | ... | 0.83 | 0.64 | 0.74 | 0.20 | 0.32 | 0.66 | 0.63 | 0.41 | 0.51 | 0.73 |
| concavity_se | 0.40 | 0.32 | 0.07 | 0.27 | 0.49 | 0.68 | 0.43 | 0.42 | 0.57 | 0.47 | ... | 0.85 | 0.44 | 0.81 | 0.00 | 0.07 | 0.37 | 0.69 | 0.34 | 0.29 | 0.40 |
| concave points_se | 0.46 | 0.35 | 0.16 | 0.37 | 0.59 | 0.63 | 0.57 | 0.43 | 0.46 | 0.57 | ... | 1.00 | 0.52 | 0.78 | 0.12 | 0.26 | 0.41 | 0.56 | 0.61 | 0.17 | 0.36 |
| symmetry_se | 0.21 | 0.36 | 0.00 | 0.44 | 0.69 | 0.45 | 0.28 | 0.61 | 0.44 | 0.65 | ... | 0.52 | 1.00 | 0.42 | 0.00 | 0.32 | 0.37 | 0.36 | 0.25 | 0.68 | 0.66 |
| fractal_dimension_se | 0.26 | 0.21 | 0.00 | 0.34 | 0.48 | 0.53 | 0.26 | 0.37 | 0.71 | 0.43 | ... | 0.78 | 0.42 | 1.00 | 0.04 | 0.27 | 0.39 | 0.46 | 0.20 | 0.16 | 0.53 |
| texture_worst | 0.61 | 0.30 | 0.93 | 0.00 | 0.18 | 0.36 | 0.35 | 0.17 | 0.19 | 0.38 | ... | 0.12 | 0.00 | 0.04 | 1.00 | 0.31 | 0.53 | 0.57 | 0.41 | 0.21 | 0.27 |
| smoothness_worst | 0.55 | 0.23 | 0.20 | 0.82 | 0.67 | 0.50 | 0.47 | 0.50 | 0.67 | 0.35 | ... | 0.26 | 0.32 | 0.27 | 0.31 | 1.00 | 0.69 | 0.57 | 0.59 | 0.69 | 0.63 |
| compactness_worst | 0.76 | 0.52 | 0.42 | 0.53 | 0.86 | 0.76 | 0.69 | 0.55 | 0.57 | 0.00 | ... | 0.41 | 0.37 | 0.39 | 0.53 | 0.69 | 1.00 | 0.93 | 0.79 | 0.75 | 0.80 |
| concavity_worst | 0.90 | 0.61 | 0.48 | 0.46 | 0.86 | 0.86 | 0.77 | 0.50 | 0.56 | 0.00 | ... | 0.56 | 0.36 | 0.46 | 0.57 | 0.57 | 0.93 | 1.00 | 0.83 | 0.66 | 0.74 |
| concave points_worst | 0.97 | 0.76 | 0.41 | 0.53 | 0.83 | 0.84 | 0.90 | 0.54 | 0.22 | 0.18 | ... | 0.61 | 0.25 | 0.20 | 0.41 | 0.59 | 0.79 | 0.83 | 1.00 | 0.53 | 0.41 |
| symmetry_worst | 0.53 | 0.16 | 0.00 | 0.68 | 0.73 | 0.53 | 0.50 | 0.82 | 0.63 | 0.20 | ... | 0.17 | 0.68 | 0.16 | 0.21 | 0.69 | 0.75 | 0.66 | 0.53 | 1.00 | 0.71 |
| fractal_dimension_worst | 0.32 | 0.19 | 0.00 | 0.57 | 0.62 | 0.45 | 0.34 | 0.41 | 0.68 | 0.00 | ... | 0.36 | 0.66 | 0.53 | 0.27 | 0.63 | 0.80 | 0.74 | 0.41 | 0.71 | 1.00 |
25 rows × 25 columns
In [33]:
from phik.report import plot_correlation_matrix
phik_overview = df.phik_matrix()
interval columns not set, guessing: ['diagnosis', 'radius_mean', 'texture_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'texture_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']
In [34]:
# Presenting Phi K data as a heatmap (credit: Jai Gupta, Stanford SPCS 2023)
plot_correlation_matrix(phik_overview.values,
x_labels=phik_overview.columns,
y_labels=phik_overview.index,
vmin=0, vmax=1, color_map="coolwarm",
title=r"correlation $\phi_K$",
fontsize_factor=1.8,
figsize=(40, 32))
plt.tight_layout()
plt.show()
In [35]:
# Creates a violin plot using the Seaborn library
sns.violinplot(df['concave points_worst'],color='y');
plt.show()
In [36]:
sns.violinplot(x=df['concave points_worst'],color='y'); # Note inclusion of "x=" to rotate the plot
plt.show()
In [37]:
# Function to create plot for categorical variable
# Annotate barplot
def annotate_bars(ax, feature):
total = len(feature)
for p in ax.patches:
percentage = '{:.2f}%'.format(100 * p.get_height() / total)
x = p.get_x() + p.get_width() / 2
y = p.get_height()
ax.annotate(percentage, (x, y), ha='center', va='bottom', size=12)
# Create figure and axis
fig, ax = plt.subplots(figsize=(10, 6))
sns.countplot(data=df, x='diagnosis', palette='winter', ax=ax)
annotate_bars(ax, df['diagnosis'])
plt.tight_layout()
plt.show()
In [38]:
# From UT Austin Computer Science Department
# Used this function to create a combo boxplot and histogram for continuous (I/R --> int64 and float64) variables
def boxplot_histogram (feature, figsize=(10,7), bins = None):
sns.set(font_scale=2)
f2, (ax_box2, ax_hist2) = plt.subplots(nrows = 2,
sharex = True,
gridspec_kw = {"height_ratios": (.25, .75)},
figsize = figsize
)
sns.boxplot(feature, ax=ax_box2, orient = "h", showmeans=True, color='red') # mean value will be noted
sns.distplot(feature, kde=F, ax=ax_hist2, bins=bins) if bins else sns.distplot(feature, kde=False, ax=ax_hist2, fit=norm)
ax_hist2.axvline(np.mean(feature), color='g', linestyle='--') # Add mean to the histogram
ax_hist2.axvline(np.median(feature), color='black', linestyle='-') # Add median to the histogram
plt.axvline(feature.mode()[0], color='r', linestyle='dashed', linewidth=1); #Add mode to the histogram
In [39]:
boxplot_histogram(df.radius_mean)
plt.show()
In [40]:
boxplot_histogram(df.concavity_mean)
plt.show()
In [41]:
boxplot_histogram(df["concave points_worst"])
plt.show()
In [42]:
# Plot histograms to check the distribution of each numeric variable
from scipy.stats import norm
all_col = df.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(17,75))
for i in range(len(all_col)):
plt.subplot(18,3,i+1)
plt.hist(df[all_col[i]])
plt.tight_layout()
plt.title(all_col[i],fontsize=25)
plt.show()
In [43]:
# Outlier detection using boxplots (for all I/R variables)
plt.figure(figsize=(20,30))
for i, variable in enumerate(df): #enumerate is a built-in function in python that allows you to keep track of the number of iterations (loops) in a loop
plt.subplot(8,4,i+1) #provides a way to plot multiple plots on a single figure
plt.boxplot(df[variable],whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
In [44]:
# Use flooring and capping method
def treat_outliers(df,col):
Q1=df[col].quantile(0.25)
Q3=df[col].quantile(0.75)
IQR=Q3-Q1
Lower_Whisker = Q1 - 1.5*IQR
Upper_Whisker = Q3 + 1.5*IQR
df[col] = np.clip(df[col], Lower_Whisker, Upper_Whisker)
return df
def treat_outliers_all(df, col_list):
for c in col_list:
df = treat_outliers(df,c)
return df
numerical_col = df.select_dtypes(include=np.number).columns.tolist()
df = treat_outliers_all(df,numerical_col)
plt.figure(figsize=(20,30))
for i, variable in enumerate(numerical_col):
plt.subplot(8,4,i+1)
plt.boxplot(df[variable],whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
Bivariate Analyses¶
In [45]:
# Boxplot with DV and 1 IV
# sns.boxplot(x = "categorical_var", y = "numeric_var", data = df)
# plt.title('graph_title')
# plt.show()
plt.figure(figsize=(10,7))
sns.boxplot(x = "diagnosis", y = "concave points_worst", data = df)
plt.title('Boxplot for diagnosis vs. concave points_worst')
plt.show()
In [46]:
ttest_boxplot = df.boxplot(column='radius_mean', by='diagnosis', figsize=(10, 6), grid=False);
ttest_boxplot.set_title('');
ttest_boxplot.set_ylabel('');
plt.show()
In [47]:
# Catplot
sns.catplot(x="diagnosis", y="concave points_worst", data=df, kind='boxen', height=6, aspect=1.6, estimator=np.mean);
figsize=(10, 6)
plt.xlabel('diagnosis', fontsize=15);
plt.title('Catplot for concave points_mean vs. diagnosis')
plt.ylabel('concave points_mean', fontsize=15);
plt.show()
In [48]:
plt.figure(figsize=(10,6))
sns.scatterplot(x='smoothness_mean', y='radius_mean', hue='diagnosis', data=df, palette='Set2')
plt.show()
In [49]:
plt.figure(figsize=(10,6))
sns.scatterplot(x='smoothness_mean', y='concave points_worst', hue='diagnosis', data=df, palette='Set2')
plt.show()
In [50]:
# View Outcome column
df.diagnosis
Out[50]:
0 1
1 1
2 1
3 1
4 1
..
564 1
565 1
566 1
567 1
568 0
Name: diagnosis, Length: 569, dtype: int64
In [51]:
# Map 0's and 1's to 'No Diabetes' and 'Diabetes', respectively
df.Outcome=df['diagnosis'].map({0:'benign', 1:'malignant'})
In [52]:
# After running code above, 0's become 'No Diabetes'...
df.Outcome
Out[52]:
0 malignant
1 malignant
2 malignant
3 malignant
4 malignant
...
564 malignant
565 malignant
566 malignant
567 malignant
568 benign
Name: diagnosis, Length: 569, dtype: object
In [53]:
# Now notice with this stripplot that the x-axis has words instead of 0 and 1
plt.figure(figsize=(10,6))
sns.stripplot(data=df, x='diagnosis', y='area_se', jitter=True);
plt.show();
In [54]:
# To map values back into numbers
df.Outcome=df['diagnosis'].map({'benign':0, 'malignant':1})
plt.show();
In [55]:
# Rerun the stripplot to see the x-axis labels are now 0 and 1 again
plt.figure(figsize=(10,6))
sns.stripplot(data=df, x='diagnosis', y='concavity_mean', jitter=True);
plt.show()
In [56]:
# Create swarm plot
plt.figure(figsize=(10,6))
sns.swarmplot(data=df, x='diagnosis', y='radius_mean');
plt.show();
One-Hot Encoding¶
In [57]:
# Converting categorical variables into binary vectors, where each category becomes a new feature with values of 0 or 1
# Let's create a fake dataframe simply looking at colors:
df_example = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Red', 'Blue']})
df_example
Out[57]:
| Color | |
|---|---|
| 0 | Red |
| 1 | Blue |
| 2 | Green |
| 3 | Red |
| 4 | Blue |
In [58]:
# Import OneHot encoder
from sklearn.preprocessing import OneHotEncoder
# Initialize One-HotEncoder
encoder = OneHotEncoder()
# Perform encoding
encoded_df = encoder.fit_transform(df_example[['Color']])
# Convert encoded data to a pandas DataFrame
encoded_df = pd.DataFrame(encoded_df.toarray(), columns=encoder.get_feature_names_out(['Color']))
# Concatenate the original data with the encoded data
data_encoded = pd.concat([df_example, encoded_df], axis=1)
print(data_encoded)
Color Color_Blue Color_Green Color_Red 0 Red 0.00 0.00 1.00 1 Blue 1.00 0.00 0.00 2 Green 0.00 1.00 0.00 3 Red 0.00 0.00 1.00 4 Blue 1.00 0.00 0.00
Supervised Machine Learning (Binary Classification)¶
In [59]:
# Make a copy of dataframe
data=df.copy()
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 569 entries, 0 to 568 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 diagnosis 569 non-null int64 1 radius_mean 569 non-null float64 2 texture_mean 569 non-null float64 3 smoothness_mean 569 non-null float64 4 compactness_mean 569 non-null float64 5 concavity_mean 569 non-null float64 6 concave points_mean 569 non-null float64 7 symmetry_mean 569 non-null float64 8 fractal_dimension_mean 569 non-null float64 9 texture_se 569 non-null float64 10 perimeter_se 569 non-null float64 11 area_se 569 non-null float64 12 smoothness_se 569 non-null float64 13 compactness_se 569 non-null float64 14 concavity_se 569 non-null float64 15 concave points_se 569 non-null float64 16 symmetry_se 569 non-null float64 17 fractal_dimension_se 569 non-null float64 18 texture_worst 569 non-null float64 19 smoothness_worst 569 non-null float64 20 compactness_worst 569 non-null float64 21 concavity_worst 569 non-null float64 22 concave points_worst 569 non-null float64 23 symmetry_worst 569 non-null float64 24 fractal_dimension_worst 569 non-null float64 dtypes: float64(24), int64(1) memory usage: 111.3 KB
In [60]:
# Libraries for different ML classifiers
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
# Libraries for model tuning and evaluation metrics
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score
from sklearn.model_selection import GridSearchCV
Decision Tree¶
In [61]:
# Separate Outcome or Target variable from the predictors
X = data.drop('diagnosis',axis=1) # Replace 'Outcome' with your target variable name
y = data['diagnosis'].astype('int64') # Replace 'Outcome' with your target variable name
# We used .astype ('int64) above to convert target to integers since some functions might not work with bool type
In [62]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.3, random_state=1)
print(X_train.shape, X_test.shape)
(398, 24) (171, 24)
Initial model using the Decision Tree Classifier¶
In [63]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(criterion='gini',class_weight={0:0.15,1:0.85},random_state=1)
# 2 commonly used splitting criteria are Gini impurity and information gain (entropy)
# Gini: measures the probability of misclassifying a randomly chosen element if it were randomly labeled
# Would goal be to minimize or maximize the Gini impurity when making splits???
# MINIMIZE
# Information Gain (Entropy): entropy measures impurity or uncertainty, while information gain quantifies reduction in entropy
# Which do we want to minimize? Maximize?
# MINIMIZE Entropy
# MAXIMIZE Information Gain
In [64]:
model.fit(X_train, y_train)
Out[64]:
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, random_state=1)In [65]:
def make_confusion_matrix(model,y_actual,labels=[1, 0]):
y_predict = model.predict(X_test)
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
columns = [i for i in ['Predicted - No','Predicted - Yes']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
In [66]:
make_confusion_matrix(model,y_test)
plt.show()
In [67]:
y_train.value_counts(1)
Out[67]:
diagnosis 0 0.63 1 0.37 Name: proportion, dtype: float64
In [68]:
column_names = list(data.columns)
column_names.remove('diagnosis') # As this is the DV
feature_names = column_names
print(feature_names)
['radius_mean', 'texture_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'texture_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']
In [69]:
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn import metrics
plt.figure(figsize=(20,30))
out = tree.plot_tree(model,feature_names=feature_names,filled=True,fontsize=9,node_ids=False,class_names=None,)
# Code below will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(1)
plt.show()
In [70]:
# Text report showing the rules of the decision tree
print(tree.export_text(model,feature_names=feature_names,show_weights=True))
|--- concave points_worst <= 0.11 | |--- radius_mean <= 15.44 | | |--- area_se <= 48.98 | | | |--- weights: [31.50, 0.00] class: 0 | | |--- area_se > 48.98 | | | |--- smoothness_worst <= 0.11 | | | | |--- weights: [0.30, 0.00] class: 0 | | | |--- smoothness_worst > 0.11 | | | | |--- weights: [0.00, 0.85] class: 1 | |--- radius_mean > 15.44 | | |--- concavity_mean <= 0.04 | | | |--- symmetry_se <= 0.01 | | | | |--- weights: [0.00, 0.85] class: 1 | | | |--- symmetry_se > 0.01 | | | | |--- weights: [0.00, 2.55] class: 1 | | |--- concavity_mean > 0.04 | | | |--- weights: [0.30, 0.00] class: 0 |--- concave points_worst > 0.11 | |--- concave points_mean <= 0.05 | | |--- compactness_se <= 0.02 | | | |--- texture_worst <= 19.91 | | | | |--- weights: [0.45, 0.00] class: 0 | | | |--- texture_worst > 19.91 | | | | |--- concavity_mean <= 0.05 | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | |--- concavity_mean > 0.05 | | | | | |--- weights: [0.00, 5.10] class: 1 | | |--- compactness_se > 0.02 | | | |--- smoothness_worst <= 0.18 | | | | |--- weights: [2.70, 0.00] class: 0 | | | |--- smoothness_worst > 0.18 | | | | |--- weights: [0.00, 0.85] class: 1 | |--- concave points_mean > 0.05 | | |--- texture_mean <= 14.16 | | | |--- weights: [0.45, 0.00] class: 0 | | |--- texture_mean > 14.16 | | | |--- concavity_worst <= 0.22 | | | | |--- weights: [0.15, 0.00] class: 0 | | | |--- concavity_worst > 0.22 | | | | |--- radius_mean <= 10.41 | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | |--- radius_mean > 10.41 | | | | | |--- area_se <= 13.47 | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | |--- area_se > 13.47 | | | | | | |--- texture_worst <= 18.35 | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- texture_worst > 18.35 | | | | | | | |--- smoothness_worst <= 0.10 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- smoothness_worst > 0.10 | | | | | | | | |--- concave points_mean <= 0.06 | | | | | | | | | |--- radius_mean <= 13.14 | | | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | | | |--- radius_mean > 13.14 | | | | | | | | | | |--- texture_mean <= 15.49 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- texture_mean > 15.49 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | |--- concave points_mean > 0.06 | | | | | | | | | |--- weights: [0.00, 107.10] class: 1
In [71]:
# Importance of features in the tree building (The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance)
print (pd.DataFrame(model.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp concave points_worst 0.70 radius_mean 0.11 concave points_mean 0.04 smoothness_worst 0.04 compactness_se 0.03 area_se 0.03 texture_mean 0.02 texture_worst 0.02 concavity_mean 0.01 concavity_worst 0.01 symmetry_worst 0.00 perimeter_se 0.00 symmetry_se 0.00 smoothness_mean 0.00 compactness_worst 0.00 compactness_mean 0.00 concave points_se 0.00 fractal_dimension_se 0.00 symmetry_mean 0.00 concavity_se 0.00 smoothness_se 0.00 texture_se 0.00 fractal_dimension_mean 0.00 fractal_dimension_worst 0.00
In [72]:
# Example of Feature Importance
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(20,15))
plt.title('Feature Importance')
plt.barh(range(len(indices)), importances[indices], color='blue', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
In [73]:
from sklearn.model_selection import GridSearchCV
In [74]:
# Choose the type of classifier
estimator = DecisionTreeClassifier(random_state=1,class_weight = {0:.15,1:.85})
# Random state = controls random shuffling and splitting
# Grid of parameters to choose from
parameters = {
'max_depth': np.arange(15,27),
'criterion': ['entropy','gini'],
'splitter': ['best','random'],
'min_impurity_decrease': [0.0001,0.001,0.001],
'max_features': ['log2','sqrt']
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the classifier to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data
estimator.fit(X_train, y_train)
Out[74]:
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, max_depth=15,
max_features='log2', min_impurity_decrease=0.001,
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, max_depth=15,
max_features='log2', min_impurity_decrease=0.001,
random_state=1)In [75]:
make_confusion_matrix(estimator,y_test)
In [76]:
plt.figure(figsize=(15,10))
out = tree.plot_tree(estimator,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=None)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(1)
plt.show()
In [77]:
print(tree.export_text(estimator,feature_names=feature_names,show_weights=False))
|--- concave points_worst <= 0.11 | |--- fractal_dimension_se <= 0.00 | | |--- concavity_worst <= 0.12 | | | |--- class: 0 | | |--- concavity_worst > 0.12 | | | |--- radius_mean <= 15.44 | | | | |--- class: 0 | | | |--- radius_mean > 15.44 | | | | |--- class: 1 | |--- fractal_dimension_se > 0.00 | | |--- compactness_se <= 0.01 | | | |--- compactness_mean <= 0.06 | | | | |--- class: 0 | | | |--- compactness_mean > 0.06 | | | | |--- smoothness_mean <= 0.10 | | | | | |--- fractal_dimension_se <= 0.00 | | | | | | |--- class: 0 | | | | | |--- fractal_dimension_se > 0.00 | | | | | | |--- class: 1 | | | | |--- smoothness_mean > 0.10 | | | | | |--- class: 0 | | |--- compactness_se > 0.01 | | | |--- class: 0 |--- concave points_worst > 0.11 | |--- texture_worst <= 18.39 | | |--- class: 0 | |--- texture_worst > 18.39 | | |--- radius_mean <= 12.65 | | | |--- concave points_se <= 0.01 | | | | |--- fractal_dimension_worst <= 0.11 | | | | | |--- class: 0 | | | | |--- fractal_dimension_worst > 0.11 | | | | | |--- class: 1 | | | |--- concave points_se > 0.01 | | | | |--- class: 0 | | |--- radius_mean > 12.65 | | | |--- concavity_worst <= 0.22 | | | | |--- class: 0 | | | |--- concavity_worst > 0.22 | | | | |--- concavity_mean <= 0.05 | | | | | |--- class: 0 | | | | |--- concavity_mean > 0.05 | | | | | |--- class: 1
In [78]:
# Gini importance
print (pd.DataFrame(estimator.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp concave points_worst 0.74 radius_mean 0.06 concavity_worst 0.05 texture_worst 0.03 compactness_mean 0.03 fractal_dimension_se 0.02 fractal_dimension_worst 0.02 concave points_se 0.02 compactness_se 0.01 smoothness_mean 0.01 concavity_mean 0.01 symmetry_mean 0.00 symmetry_worst 0.00 compactness_worst 0.00 smoothness_worst 0.00 concave points_mean 0.00 fractal_dimension_mean 0.00 texture_se 0.00 concavity_se 0.00 texture_mean 0.00 smoothness_se 0.00 area_se 0.00 perimeter_se 0.00 symmetry_se 0.00
In [79]:
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(20,15))
plt.title('Feature Importance')
plt.barh(range(len(indices)), importances[indices], color='green', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Decision Tree¶
In [80]:
dtree_estimator = DecisionTreeClassifier(class_weight={0:0.3,1:0.70},random_state=1)
dtree_estimator = DecisionTreeClassifier(random_state=1, class_weight={0: 0.3, 1: 0.7})
dtree_estimator.fit(X_train, y_train)
Out[80]:
DecisionTreeClassifier(class_weight={0: 0.3, 1: 0.7}, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight={0: 0.3, 1: 0.7}, random_state=1)In [81]:
# Function to calculate different metric scores - Accuracy, Recall, Preceision, and F1 Scores
def get_metrics_score(model,flag=True):
# defining an empty list to store train and test results
score_list=[]
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
train_acc = model.score(X_train,y_train)
test_acc = model.score(X_test,y_test)
train_recall = metrics.recall_score(y_train,pred_train)
test_recall = metrics.recall_score(y_test,pred_test)
# Recall = minimizes false negatives
train_precision = metrics.precision_score(y_train,pred_train)
test_precision = metrics.precision_score(y_test,pred_test)
# Precision = minimizes false positives
train_f1 = metrics.f1_score(y_train,pred_train)
test_f1 = metrics.f1_score(y_test,pred_test)
# F1 Score = balances precision and recall
score_list.extend((train_acc,test_acc,train_recall,test_recall,train_precision,test_precision,train_f1,test_f1))
if flag == True:
print("Accuracy on training set : ",model.score(X_train,y_train))
print("Accuracy on test set : ",model.score(X_test,y_test))
print("Recall on training set : ",metrics.recall_score(y_train,pred_train))
print("Recall on test set : ",metrics.recall_score(y_test,pred_test))
print("Precision on training set : ",metrics.precision_score(y_train,pred_train))
print("Precision on test set : ",metrics.precision_score(y_test,pred_test))
print("F1 Score on training set : ",metrics.f1_score(y_train,pred_train))
print("F1 Score on test set : ",metrics.f1_score(y_test,pred_test))
return score_list # returns the list with train and test scores
In [82]:
# Function to make confusion matrix
def make_confusion_matrix(model, y_actual, labels=[1, 0]):
# Predict using the model
y_predict = model.predict(X_test) # <- This should match y_actual's source
cm = metrics.confusion_matrix(y_actual, y_predict, labels=[0, 1])
# Create a labeled DataFrame for the confusion matrix
df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
columns = [i for i in ['Predicted - No','Predicted - Yes']])
# Format counts and percentages for annotation
group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm.flatten()/np.sum(cm)]
# Combine into annotation labels
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2,2)
# Plot the heatmap
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels, fmt='', cmap='Blues')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
In [83]:
# Calculate metrics for your model
get_metrics_score(dtree_estimator)
# Create confusion matrix for your mdodel
make_confusion_matrix(dtree_estimator, y_test)
Accuracy on training set : 1.0 Accuracy on test set : 0.9239766081871345 Recall on training set : 1.0 Recall on test set : 0.8412698412698413 Precision on training set : 1.0 Precision on test set : 0.9464285714285714 F1 Score on training set : 1.0 F1 Score on test set : 0.8907563025210085
In [84]:
dtree_tuned = DecisionTreeClassifier(class_weight={0:0.35, 1:0.65}, random_state=1)
parameters = {
'max_depth': np.arange(2, 10),
'min_samples_leaf': [5, 7, 10, 15],
'max_leaf_nodes': [2, 3, 5, 10, 15],
'min_impurity_decrease': [0.0001, 0.001, 0.01, 0.1]
}
# Parameters above control the size and shape of the tree, preventing it from growing too large or fitting noise.
scorer = metrics.make_scorer(metrics.recall_score)
grid_obj = GridSearchCV(dtree_tuned, parameters, scoring=scorer, n_jobs=-1)
grid_obj.fit(X_train, y_train)
dtree_tuned = grid_obj.best_estimator_
dtree_tuned.fit(X_train, y_train)
Out[84]:
DecisionTreeClassifier(class_weight={0: 0.35, 1: 0.65}, max_depth=3,
max_leaf_nodes=5, min_impurity_decrease=0.01,
min_samples_leaf=7, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight={0: 0.35, 1: 0.65}, max_depth=3,
max_leaf_nodes=5, min_impurity_decrease=0.01,
min_samples_leaf=7, random_state=1)In [85]:
get_metrics_score(dtree_tuned)
make_confusion_matrix(dtree_tuned,y_test)
Accuracy on training set : 0.957286432160804 Accuracy on test set : 0.9181286549707602 Recall on training set : 0.959731543624161 Recall on test set : 0.8888888888888888 Precision on training set : 0.9285714285714286 Precision on test set : 0.8888888888888888 F1 Score on training set : 0.9438943894389439 F1 Score on test set : 0.8888888888888888
Bagging Classifier¶
In [86]:
# Fit the model
bagging_classifier = BaggingClassifier(random_state=1)
bagging_classifier.fit(X_train,y_train)
# Calculate metrics
get_metrics_score(bagging_classifier)
# Create the confusion matrix
make_confusion_matrix(bagging_classifier,y_test)
Accuracy on training set : 0.9974874371859297 Accuracy on test set : 0.9590643274853801 Recall on training set : 0.9932885906040269 Recall on test set : 0.9206349206349206 Precision on training set : 1.0 Precision on test set : 0.9666666666666667 F1 Score on training set : 0.9966329966329966 F1 Score on test set : 0.943089430894309
Tuned Bagging Classifier¶
In [87]:
# Define base BaggingClassifier
bagging_tuned = BaggingClassifier(random_state=1)
# Hyperparameter grid
parameters = {
'max_samples': [0.7, 0.8, 0.9, 1],
'max_features': [0.7, 0.8, 0.9, 1],
'n_estimators': [10, 20, 30, 40, 50],
}
# Use recall as scoring metric
recall_scorer = metrics.make_scorer(metrics.recall_score)
# Setup GridSearchCV with parallel jobs and 5-fold CV
grid_obj = GridSearchCV(bagging_tuned, parameters, scoring=recall_scorer, cv=5, n_jobs=-1)
# Fit grid search
grid_obj.fit(X_train, y_train)
# Best estimator from grid search
bagging_tuned = grid_obj.best_estimator_
# Fit the best model on full training data
bagging_tuned.fit(X_train, y_train)
Out[87]:
BaggingClassifier(max_features=0.7, max_samples=0.9, n_estimators=40,
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
BaggingClassifier(max_features=0.7, max_samples=0.9, n_estimators=40,
random_state=1)In [88]:
get_metrics_score(bagging_tuned)
make_confusion_matrix(bagging_tuned,y_test)
Accuracy on training set : 0.9974874371859297 Accuracy on test set : 0.9532163742690059 Recall on training set : 0.9932885906040269 Recall on test set : 0.9206349206349206 Precision on training set : 1.0 Precision on test set : 0.9508196721311475 F1 Score on training set : 0.9966329966329966 F1 Score on test set : 0.9354838709677419
Random Forest¶
In [89]:
# Fit the model
rf_estimator = RandomForestClassifier(random_state=1)
rf_estimator.fit(X_train,y_train)
# Calculate metrics
get_metrics_score(rf_estimator)
# Create the confusion matrix
make_confusion_matrix(rf_estimator,y_test)
Accuracy on training set : 1.0 Accuracy on test set : 0.9590643274853801 Recall on training set : 1.0 Recall on test set : 0.9206349206349206 Precision on training set : 1.0 Precision on test set : 0.9666666666666667 F1 Score on training set : 1.0 F1 Score on test set : 0.943089430894309
In [90]:
%%time
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
import numpy as np
rf_tuned = RandomForestClassifier(class_weight={0:0.35,1:0.65}, random_state=1)
parameters = {
'max_depth': list(np.arange(3, 10, 1)),
'max_features': np.arange(0.6, 1.1, 0.1),
'min_samples_split': np.arange(2, 20, 5),
'n_estimators': np.arange(30, 160, 20),
'min_impurity_decrease': [0.0001, 0.001, 0.01, 0.1]
}
scorer = metrics.make_scorer(metrics.recall_score)
grid_obj = GridSearchCV(rf_tuned, parameters, scoring=scorer, cv=5, n_jobs=-1)
grid_obj.fit(X_train, y_train)
rf_tuned = grid_obj.best_estimator_
rf_tuned.fit(X_train, y_train)
CPU times: user 7.78 s, sys: 991 ms, total: 8.77 s Wall time: 3min 9s
Out[90]:
RandomForestClassifier(class_weight={0: 0.35, 1: 0.65}, max_depth=5,
max_features=0.6, min_impurity_decrease=0.01,
min_samples_split=12, n_estimators=90, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(class_weight={0: 0.35, 1: 0.65}, max_depth=5,
max_features=0.6, min_impurity_decrease=0.01,
min_samples_split=12, n_estimators=90, random_state=1)In [91]:
#Calculating different metrics
get_metrics_score(rf_tuned)
#Creating confusion matrix
make_confusion_matrix(rf_tuned,y_test)
Accuracy on training set : 0.9698492462311558 Accuracy on test set : 0.9415204678362573 Recall on training set : 0.9731543624161074 Recall on test set : 0.9206349206349206 Precision on training set : 0.9477124183006536 Precision on test set : 0.9206349206349206 F1 Score on training set : 0.9602649006622517 F1 Score on test set : 0.9206349206349206
Comparing Supervised ML Classification Models¶
In [92]:
# Identify the models to compare
models = [dtree_estimator, dtree_tuned, bagging_classifier, bagging_tuned, rf_estimator, rf_tuned]
# Define empty lists to store results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_train = []
f1_test = []
# Loop through all models to collect metrics (Accuracy, Recall, Precision, F1)
for model in models:
scores = get_metrics_score(model, False)
acc_train.append(scores[0])
acc_test.append(scores[1])
recall_train.append(scores[2])
recall_test.append(scores[3])
precision_train.append(scores[4])
precision_test.append(scores[5])
f1_train.append(scores[6])
f1_test.append(scores[7])
In [93]:
# Compare models on evaluation metrics
comparison_frame = pd.DataFrame({
'Model': ['Decision Tree', 'Tuned Decision Tree', 'Bagging Classifier', 'Tuned Bagging Classifier', 'Random Forest', 'Tuned Random Forest'],
'Train_Accuracy': acc_train,
'Test_Accuracy': acc_test,
'Train_Recall': recall_train,
'Test_Recall': recall_test,
'Train_Precision': precision_train,
'Test_Precision': precision_test,
'Train_F1': f1_train,
'Test_F1': f1_test
})
# Sort models in decreasing order of most important metric
comparison_frame_sorted = comparison_frame.sort_values(by='Test_Recall', ascending=False)
# Set display options to avoid wrapping wide DataFrames
pd.set_option('display.width', 200)
pd.set_option('display.max_columns', None)
# Print sorted DataFrame
print(comparison_frame_sorted)
Model Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision Train_F1 Test_F1 2 Bagging Classifier 1.00 0.96 0.99 0.92 1.00 0.97 1.00 0.94 3 Tuned Bagging Classifier 1.00 0.95 0.99 0.92 1.00 0.95 1.00 0.94 4 Random Forest 1.00 0.96 1.00 0.92 1.00 0.97 1.00 0.94 5 Tuned Random Forest 0.97 0.94 0.97 0.92 0.95 0.92 0.96 0.92 1 Tuned Decision Tree 0.96 0.92 0.96 0.89 0.93 0.89 0.94 0.89 0 Decision Tree 1.00 0.92 1.00 0.84 1.00 0.95 1.00 0.89
In [94]:
# Get feature names and their importance scores for the best model
model_name = 'Tuned Random Forest' # dynamically assign this as needed
plt.figure(figsize=(20, 15))
plt.title(f'Feature Importance - {model_name}')
plt.barh(range(len(indices)), importances[indices], color='green', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
for i, idx in enumerate(indices):
val = importances[idx]
if val < 0.001:
label = "<0.001"
else:
label = f"{val:.3f}"
plt.text(val + 0.001, i, label, va='center', fontsize=15)
plt.tight_layout()
plt.show()
Logistic Regression¶
In [95]:
# Import additional library
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score
In [96]:
# Standardize the features before running logistic regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Initialize and train the logistic regression model
log_reg = LogisticRegression(
solver='newton-cg',
max_iter=1000,
penalty='l2', # Regularization
verbose=True, # Shows optimization progress
n_jobs=-1, # Use all CPU cores
random_state=1
)
In [97]:
def get_metrics_score(model, X_test, y_test):
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
In [98]:
# Fit the model to the training data
log_reg.fit(X_train_scaled, y_train)
# Predict probabilities for ROC curve
y_probs = log_reg.predict_proba(X_test_scaled)[:, 1]
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
In [99]:
def get_metrics_score(model, X_train, y_train, X_test, y_test):
# Predict on train and test
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)
# Print results
print("Model Performance")
print("-" * 40)
print("{:<15} {:<10} {:<10}".format("Metric", "Train", "Test"))
print("-" * 40)
print("{:<15} {:<10.2f} {:<10.2f}".format("Accuracy",
accuracy_score(y_train, y_pred_train), accuracy_score(y_test, y_pred_test)))
print("{:<15} {:<10.2f} {:<10.2f}".format("Precision",
precision_score(y_train, y_pred_train), precision_score(y_test, y_pred_test)))
print("{:<15} {:<10.2f} {:<10.2f}".format("Recall",
recall_score(y_train, y_pred_train), recall_score(y_test, y_pred_test)))
print("{:<15} {:<10.2f} {:<10.2f}".format("F1 Score",
f1_score(y_train, y_pred_train), f1_score(y_test, y_pred_test)))
print("-" * 40)
In [100]:
def make_confusion_matrix(model, X_test, y_actual, labels=[0, 1]):
# Predict using the model
y_predict = model.predict(X_test)
# Compute confusion matrix
cm = metrics.confusion_matrix(y_actual, y_predict, labels=labels)
# Create labeled DataFrame
df_cm = pd.DataFrame(
cm,
index=["Actual - No", "Actual - Yes"],
columns=["Predicted - No", "Predicted - Yes"]
)
# Prepare annotation labels (counts + percentages)
group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm.flatten() / np.sum(cm)]
annot_labels = [f"{count}\n{percent}" for count, percent in zip(group_counts, group_percentages)]
annot_labels = np.asarray(annot_labels).reshape(cm.shape)
# Plot the heatmap
plt.figure(figsize=(10, 7))
sns.heatmap(df_cm, annot=annot_labels, fmt="", cmap="Blues")
plt.ylabel("True label")
plt.xlabel("Predicted label")
plt.title("")
plt.tight_layout()
plt.show()
In [101]:
get_metrics_score(log_reg, X_train_scaled, y_train, X_test_scaled, y_test)
make_confusion_matrix(log_reg, X_test_scaled, y_test)
Model Performance ---------------------------------------- Metric Train Test ---------------------------------------- Accuracy 0.99 0.95 Precision 1.00 0.94 Recall 0.98 0.94 F1 Score 0.99 0.94 ----------------------------------------
Create ROC Curve¶
In [102]:
# Import additional library
from sklearn.metrics import roc_curve, auc
# Ensure X_test is scaled like training data
y_probs = log_reg.predict_proba(X_test_scaled)[:, 1]
# Compute ROC curve and AUC
fpr, tpr, thresholds = roc_curve(y_test, y_probs)
roc_auc = auc(fpr, tpr)
# Plot
plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC Curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='red', lw=1, linestyle='--', label='Chance')
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('Receiver Operating Characteristic (ROC) Curve', fontsize=14)
plt.tick_params(axis='both', labelsize=12)
plt.legend(loc='lower right', fontsize=14)
plt.grid(True)
plt.tight_layout()
plt.show()
Sigmoid Curve¶
In [103]:
from scipy.special import expit # Numerically stable sigmoid
# Choose the feature you want to visualize
feature_name = "concave points_worst" # <-- CHANGE this to your real feature name
if feature_name not in X_train.columns:
raise ValueError(f"'{feature_name}' is not a valid feature in X_train.")
# Get the index of the feature
feature_index = list(X_train.columns).index(feature_name)
# Get the corresponding coefficient and intercept from the trained model
coef = log_reg.coef_[0][feature_index]
intercept = log_reg.intercept_[0]
# Generate a range of values across that feature's actual (unscaled) range
x_vals = np.linspace(X_train[feature_name].min(), X_train[feature_name].max(), 300)
# Re-standardize manually to match the model's input scale
mean = X_train[feature_name].mean()
std = X_train[feature_name].std()
x_vals_scaled = (x_vals - mean) / std
# Compute z = w*x + b and apply sigmoid
z = intercept + coef * x_vals_scaled
sigmoid_vals = expit(z)
# Plot the sigmoid function
plt.figure(figsize=(10, 6))
plt.plot(x_vals, sigmoid_vals, label='Sigmoid Curve', color='blue')
plt.axhline(0.5, color='red', linestyle='--', label='Threshold = 0.5')
plt.xlabel(f'{feature_name}', fontsize=15)
plt.ylabel('Predicted Probability (Class 1)', fontsize=15)
plt.title(f'Sigmoid Function for Feature: {feature_name}')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
# Compute the feature value at which probability = 0.5 (decision boundary)
decision_boundary_scaled = -intercept / coef
decision_boundary = decision_boundary_scaled * std + mean # convert back to original scale
# Compute curve steepness (magnitude of slope)
steepness = abs(coef)
# Generate interpretation message
print("\nInterpretation:")
print(f"- This sigmoid curve shows how the feature '{feature_name}' influences the model's prediction for class 1.")
print(f"- The decision threshold (where predicted probability = 0.5) occurs at approximately **{decision_boundary:.2f}**.")
print(f"- Feature values **below {decision_boundary:.2f}** are associated with a low predicted probability of class 1.")
print(f"- Feature values **above {decision_boundary:.2f}** are associated with a high predicted probability of class 1.")
# Interpret steepness
if steepness > 5:
print(f"- The curve is steep, meaning '{feature_name}' is a **strong predictor** in the model.")
elif steepness > 1:
print(f"- The curve is moderately steep, so '{feature_name}' is a **meaningful but not dominant predictor**.")
else:
print(f"- The curve is relatively flat, so '{feature_name}' may have **limited predictive power** on its own.")
Interpretation: - This sigmoid curve shows how the feature 'concave points_worst' influences the model's prediction for class 1. - The decision threshold (where predicted probability = 0.5) occurs at approximately **0.17**. - Feature values **below 0.17** are associated with a low predicted probability of class 1. - Feature values **above 0.17** are associated with a high predicted probability of class 1. - The curve is relatively flat, so 'concave points_worst' may have **limited predictive power** on its own.
Support Vector Machines (SVM)¶
In [104]:
# Import additional required library
from sklearn.svm import SVC
# Build function to evaluate model on scaled data
def evaluate_model(name, model, X_train_scaled, y_train, X_test_scaled, y_test):
print(f"\n Performance: {name}")
print("-" * 50)
# Predict on train and test sets
y_pred_train = model.predict(X_train_scaled)
y_pred_test = model.predict(X_test_scaled)
# Compute metrics
train_metrics = {
'Accuracy': accuracy_score(y_train, y_pred_train),
'Precision': precision_score(y_train, y_pred_train),
'Recall': recall_score(y_train, y_pred_train),
'F1 Score': f1_score(y_train, y_pred_train)
}
test_metrics = {
'Accuracy': accuracy_score(y_test, y_pred_test),
'Precision': precision_score(y_test, y_pred_test),
'Recall': recall_score(y_test, y_pred_test),
'F1 Score': f1_score(y_test, y_pred_test)
}
# Print table-style output
print("{:<12} {:<10} {:<10}".format("Metric", "Train", "Test"))
print("-" * 32)
for metric in train_metrics:
print("{:<12} {:<10.2f} {:<10.2f}".format(
metric, train_metrics[metric], test_metrics[metric]
))
# Confusion Matrix (Test Set Only)
cm = confusion_matrix(y_test, y_pred_test)
total = cm.sum()
labels = [f"{v}\n{v/total:.2%}" for v in cm.flatten()]
labels = np.array(labels).reshape(cm.shape)
df_cm = pd.DataFrame(cm,
index=["Actual - No", "Actual - Yes"],
columns=["Predicted - No", "Predicted - Yes"])
plt.figure(figsize=(10, 6))
sns.heatmap(df_cm, annot=labels, fmt='', cmap='Blues', cbar=False)
plt.title("")
plt.xlabel("Predicted label")
plt.ylabel("True label")
plt.tight_layout()
plt.show()
# Create the SVM model with a linear kernel
svm_model = SVC(kernel='linear', probability=True, random_state=0)
# Train the model
svm_model.fit(X_train_scaled, y_train)
evaluate_model("Support Vector Machine", svm_model, X_train_scaled, y_train, X_test_scaled, y_test)
Performance: Support Vector Machine -------------------------------------------------- Metric Train Test -------------------------------- Accuracy 0.99 0.96 Precision 1.00 0.97 Recall 0.97 0.92 F1 Score 0.99 0.94
In [105]:
# Create SVM confusion matrix
def make_svm_confusion_matrix(model, X_test, y_test, labels=[0, 1]):
# Predict using the model
y_pred = model.predict(X_test)
# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=labels)
# Create labeled DataFrame
df_cm = pd.DataFrame(
cm,
index=["Actual - No", "Actual - Yes"],
columns=["Predicted - No", "Predicted - Yes"]
)
# Prepare annotation labels (counts + percentages)
group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm.flatten() / np.sum(cm)]
annot_labels = [f"{count}\n{percent}" for count, percent in zip(group_counts, group_percentages)]
annot_labels = np.asarray(annot_labels).reshape(cm.shape)
# Plot the heatmap
plt.figure(figsize=(10, 7))
sns.heatmap(df_cm, annot=annot_labels, fmt="", cmap="Blues")
plt.title("")
plt.ylabel("True label")
plt.xlabel("Predicted label")
plt.tight_layout()
plt.show()
make_svm_confusion_matrix(svm_model, X_test, y_test)
Boosting Algorithms¶
In [106]:
def evaluate_model(name, model, X_train, y_train, X_test, y_test):
print(f"\n Performance: {name}")
print("-" * 50)
# Predict on train and test sets
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)
# Compute metrics
train_metrics = {
'Accuracy': accuracy_score(y_train, y_pred_train),
'Precision': precision_score(y_train, y_pred_train),
'Recall': recall_score(y_train, y_pred_train),
'F1 Score': f1_score(y_train, y_pred_train)
}
test_metrics = {
'Accuracy': accuracy_score(y_test, y_pred_test),
'Precision': precision_score(y_test, y_pred_test),
'Recall': recall_score(y_test, y_pred_test),
'F1 Score': f1_score(y_test, y_pred_test)
}
# Print table-style output
print("{:<12} {:<10} {:<10}".format("Metric", "Train", "Test"))
print("-" * 32)
for metric in train_metrics:
print("{:<12} {:<10.2f} {:<10.2f}".format(
metric, train_metrics[metric], test_metrics[metric]
))
# Confusion Matrix (Test Set Only)
cm = confusion_matrix(y_test, y_pred_test)
total = cm.sum()
labels = [f"{v}\n{v/total:.2%}" for v in cm.flatten()]
labels = np.array(labels).reshape(cm.shape)
df_cm = pd.DataFrame(cm,
index=["Actual - No", "Actual - Yes"],
columns=["Predicted - No", "Predicted - Yes"])
plt.figure(figsize=(10, 6))
sns.heatmap(df_cm, annot=labels, fmt='', cmap='Blues', cbar=False)
plt.title("")
plt.xlabel("Predicted label")
plt.ylabel("True label")
plt.tight_layout()
plt.show()
AdaBoost (Adaptive Boosting)¶
In [107]:
from sklearn.ensemble import AdaBoostClassifier
adaboost = AdaBoostClassifier(random_state=0)
adaboost.fit(X_train, y_train)
evaluate_model("AdaBoost", adaboost, X_train, y_train, X_test, y_test)
Performance: AdaBoost -------------------------------------------------- Metric Train Test -------------------------------- Accuracy 1.00 0.97 Precision 1.00 0.97 Recall 1.00 0.95 F1 Score 1.00 0.96
Tuned AdaBoost¶
In [108]:
tuned_adaboost = AdaBoostClassifier(
estimator=DecisionTreeClassifier(max_depth=2),
n_estimators=100,
learning_rate=0.8,
algorithm='SAMME.R',
random_state=0
)
tuned_adaboost.fit(X_train, y_train)
evaluate_model("Tuned AdaBoost", tuned_adaboost, X_train, y_train, X_test, y_test)
Performance: Tuned AdaBoost -------------------------------------------------- Metric Train Test -------------------------------- Accuracy 1.00 0.96 Precision 1.00 0.97 Recall 1.00 0.94 F1 Score 1.00 0.95
Gradient Boosting¶
In [109]:
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier(random_state=0)
gb.fit(X_train, y_train)
evaluate_model("Gradient Boosting", gb, X_train, y_train, X_test, y_test)
Performance: Gradient Boosting -------------------------------------------------- Metric Train Test -------------------------------- Accuracy 1.00 0.96 Precision 1.00 0.97 Recall 1.00 0.92 F1 Score 1.00 0.94
Tuned Gradient Boosting¶
In [110]:
tuned_gb = GradientBoostingClassifier(
n_estimators=200,
learning_rate=0.05,
max_depth=3,
subsample=0.8,
max_features='sqrt',
random_state=0
)
tuned_gb.fit(X_train, y_train)
evaluate_model("Tuned Gradient Boosting", tuned_gb, X_train, y_train, X_test, y_test)
Performance: Tuned Gradient Boosting -------------------------------------------------- Metric Train Test -------------------------------- Accuracy 1.00 0.96 Precision 1.00 0.95 Recall 1.00 0.94 F1 Score 1.00 0.94
XGBoost¶
In [111]:
!pip install xgboost
from xgboost import XGBClassifier
Requirement already satisfied: xgboost in /opt/anaconda3/lib/python3.12/site-packages (3.0.2) Requirement already satisfied: numpy in /opt/anaconda3/lib/python3.12/site-packages (from xgboost) (1.26.4) Requirement already satisfied: scipy in /opt/anaconda3/lib/python3.12/site-packages (from xgboost) (1.13.1)
In [112]:
xgb = XGBClassifier(
use_label_encoder=False,
eval_metric='logloss',
random_state=0
)
xgb.fit(X_train, y_train)
evaluate_model("XGBoost (Default)", xgb, X_train, y_train, X_test, y_test)
Performance: XGBoost (Default) -------------------------------------------------- Metric Train Test -------------------------------- Accuracy 1.00 0.97 Precision 1.00 0.98 Recall 1.00 0.94 F1 Score 1.00 0.96
Tuned XGBoost¶
In [113]:
tuned_xgb = XGBClassifier(
n_estimators=300,
learning_rate=0.05,
max_depth=4,
subsample=0.8,
colsample_bytree=0.8,
gamma=0.1,
reg_alpha=0.5,
reg_lambda=1,
use_label_encoder=False,
eval_metric='logloss',
random_state=0
)
tuned_xgb.fit(X_train, y_train)
evaluate_model("XGBoost (Tuned)", tuned_xgb, X_train, y_train, X_test, y_test)
Performance: XGBoost (Tuned) -------------------------------------------------- Metric Train Test -------------------------------- Accuracy 1.00 0.98 Precision 1.00 0.98 Recall 0.99 0.95 F1 Score 1.00 0.97
Create expanded comparison table¶
In [114]:
# Identify all models to compare
models = [
dtree_estimator,
dtree_tuned,
bagging_classifier,
bagging_tuned,
rf_estimator,
rf_tuned,
log_reg,
svm_model,
adaboost,
tuned_adaboost,
gb,
tuned_gb,
xgb,
tuned_xgb
]
In [115]:
# List of model names
model_names = [
'Decision Tree',
'Tuned Decision Tree',
'Bagging Classifier',
'Tuned Bagging Classifier',
'Random Forest',
'Tuned Random Forest',
'Logistic Regression',
'Support Vector Machine',
'AdaBoost',
'Tuned AdaBoost',
'Gradient Boosting',
'Tuned Gradient Boosting',
'XGBoost',
'Tuned XGBoost'
]
In [116]:
# Clear the lists
acc_train, acc_test = [], []
recall_train, recall_test = [], []
precision_train, precision_test = [], []
f1_train, f1_test = [], []
final_model_names = []
In [117]:
def get_metrics_score(model, X_train, y_train, X_test, y_test):
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)
acc_train = accuracy_score(y_train, y_pred_train)
acc_test = accuracy_score(y_test, y_pred_test)
recall_train = recall_score(y_train, y_pred_train)
recall_test = recall_score(y_test, y_pred_test)
precision_train = precision_score(y_train, y_pred_train)
precision_test = precision_score(y_test, y_pred_test)
f1_train = f1_score(y_train, y_pred_train)
f1_test = f1_score(y_test, y_pred_test)
return [
acc_train, acc_test,
recall_train, recall_test,
precision_train, precision_test,
f1_train, f1_test
]
In [118]:
model_data_map = {
"Logistic Regression": (X_train_scaled, X_test_scaled),
"Support Vector Machine": (X_train_scaled, X_test_scaled),
# others use unscaled
}
In [119]:
# Initialize this
final_model_names = []
for model, name in zip(models, model_names):
if model is None:
continue
# Use scaled data if defined for that model
X_tr, X_te = model_data_map.get(name, (X_train, X_test))
scores = get_metrics_score(model, X_tr, y_train, X_te, y_test)
acc_train.append(scores[0])
acc_test.append(scores[1])
recall_train.append(scores[2])
recall_test.append(scores[3])
precision_train.append(scores[4])
precision_test.append(scores[5])
f1_train.append(scores[6])
f1_test.append(scores[7])
final_model_names.append(name)
In [120]:
print("Lengths:", len(final_model_names), len(acc_train), len(acc_test), len(f1_test))
Lengths: 14 14 14 14
In [121]:
comparison_frame = pd.DataFrame({
'Model': final_model_names,
'Train_Accuracy': acc_train,
'Test_Accuracy': acc_test,
'Train_Recall': recall_train,
'Test_Recall': recall_test,
'Train_Precision': precision_train,
'Test_Precision': precision_test,
'Train_F1': f1_train,
'Test_F1': f1_test
})
comparison_frame_sorted = comparison_frame.sort_values(by='Test_Recall', ascending=False) # Sort by most important metric
# Display the table
pd.set_option('display.width', 200)
pd.set_option('display.max_columns', None)
comparison_frame_sorted
Out[121]:
| Model | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | Train_F1 | Test_F1 | |
|---|---|---|---|---|---|---|---|---|---|
| 8 | AdaBoost | 1.00 | 0.97 | 1.00 | 0.95 | 1.00 | 0.97 | 1.00 | 0.96 |
| 13 | Tuned XGBoost | 1.00 | 0.98 | 0.99 | 0.95 | 1.00 | 0.98 | 1.00 | 0.97 |
| 6 | Logistic Regression | 0.99 | 0.95 | 0.98 | 0.94 | 1.00 | 0.94 | 0.99 | 0.94 |
| 9 | Tuned AdaBoost | 1.00 | 0.96 | 1.00 | 0.94 | 1.00 | 0.97 | 1.00 | 0.95 |
| 11 | Tuned Gradient Boosting | 1.00 | 0.96 | 1.00 | 0.94 | 1.00 | 0.95 | 1.00 | 0.94 |
| 12 | XGBoost | 1.00 | 0.97 | 1.00 | 0.94 | 1.00 | 0.98 | 1.00 | 0.96 |
| 2 | Bagging Classifier | 1.00 | 0.96 | 0.99 | 0.92 | 1.00 | 0.97 | 1.00 | 0.94 |
| 3 | Tuned Bagging Classifier | 1.00 | 0.95 | 0.99 | 0.92 | 1.00 | 0.95 | 1.00 | 0.94 |
| 4 | Random Forest | 1.00 | 0.96 | 1.00 | 0.92 | 1.00 | 0.97 | 1.00 | 0.94 |
| 5 | Tuned Random Forest | 0.97 | 0.94 | 0.97 | 0.92 | 0.95 | 0.92 | 0.96 | 0.92 |
| 7 | Support Vector Machine | 0.99 | 0.96 | 0.97 | 0.92 | 1.00 | 0.97 | 0.99 | 0.94 |
| 10 | Gradient Boosting | 1.00 | 0.96 | 1.00 | 0.92 | 1.00 | 0.97 | 1.00 | 0.94 |
| 1 | Tuned Decision Tree | 0.96 | 0.92 | 0.96 | 0.89 | 0.93 | 0.89 | 0.94 | 0.89 |
| 0 | Decision Tree | 1.00 | 0.92 | 1.00 | 0.84 | 1.00 | 0.95 | 1.00 | 0.89 |